0% found this document useful (0 votes)
1 views

Untitled document (4)

The document consists of a series of questions and answers related to data science concepts, methodologies, and tools. Key topics include knowledge extraction from big datasets, data types, the data science process, and various algorithms and techniques used in machine learning and data visualization. It also covers specific tools and libraries relevant to data science, such as Hadoop, JSON, and TensorFlow.

Uploaded by

justjoyapple123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Untitled document (4)

The document consists of a series of questions and answers related to data science concepts, methodologies, and tools. Key topics include knowledge extraction from big datasets, data types, the data science process, and various algorithms and techniques used in machine learning and data visualization. It also covers specific tools and libraries relevant to data science, such as Hadoop, JSON, and TensorFlow.

Uploaded by

justjoyapple123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1. What is the primary goal of extracting knowledge from big datasets?

Answer: b) Pattern recognition for decision-making


Explanation:

a) Data storage optimization – Not the main goal, though efficient storage is important.

b) Pattern recognition for decision-making – Correct. Data science aims to discover patterns to
guide strategic actions.

c) Hardware maintenance – Irrelevant to data analysis.

d) Network security – A separate domain, though it can use data science techniques.

e) Software licensing – Not related to extracting knowledge from data.

---

2. Which data type is characterized by a fixed schema?

Answer: c) Structured
Explanation:

a) Unstructured – No predefined format (e.g., text, video).

b) Semi-structured – Has some structure (e.g., XML, JSON) but not rigid.

c) Structured – Correct. Follows a fixed schema like relational databases.

d) Temporal – Relates to time-based data, not schema.

e) Geospatial – Data with location info; structure varies.

---

3. The data science process step that involves handling outliers occurs during:

Answer: b) Pre-processing
Explanation:

a) Data collection – Gathering raw data, not modifying it.

b) Pre-processing – Correct. Cleaning data, handling outliers and missing values.

c) Training – Applies models to cleaned data.

d) Deployment – Putting models into production.

e) Visualization – Shows data, doesn't modify it.

---

4. Which tool is specifically designed for distributed data processing?

Answer: b) Hadoop
Explanation:

a) Excel – For local spreadsheet tasks.

b) Hadoop – Correct. Distributed storage and processing of large datasets.

c) Photoshop – For image editing.

d) PowerPoint – For presentations.

e) WordPress – Web content management.

---

5. A box plot is most effective for displaying:

Answer: b) Data distributions


Explanation:

a) Time-series trends – Better shown with line charts.

b) Data distributions – Correct. Shows median, quartiles, and outliers.


c) Geographic patterns – Maps or GIS visualizations are used.

d) Text sentiment – Bar or word clouds are better.

e) Audio waveforms – Use waveform or spectrogram plots.

---

6. In probability theory, the Poisson distribution is typically used to model:

Answer: c) Rare event counts


Explanation:

a) Continuous measurements – Use normal or exponential distributions.

b) Binary outcomes – Use Bernoulli or Binomial.

c) Rare event counts – Correct. Like calls per hour or website hits.

d) Circular data – Modeled with von Mises distribution.

e) Categorical variables – Use Multinomial distribution.

7. Which phase of the data science process includes feature engineering?

Answer: b) Pre-processing
Explanation:

a) Data collection – Gathering data, no modifications.

b) Pre-processing – Correct. Includes feature creation/modification.

c) Model deployment – Using the model in production.

d) Business reporting – Presenting results, not transforming features.

e) Hardware setup – Infrastructure, not data handling.

---
8. The term "ETL" in data management refers to:

Answer: b) Extract, Transform, Load


Explanation:

a) Error Tracking Logs – Unrelated.

b) Extract, Transform, Load – Correct. Pipeline to prepare data for analysis.

c) Embedded Testing Layers – Not applicable.

d) Encryption Transfer Logic – Not a standard term.

e) External Tagging Libraries – Doesn't describe data pipelines.

---

9. Which is a key characteristic of unstructured data?

Answer: c) No predefined format


Explanation:

a) Fits relational databases – That's structured data.

b) Contains metadata tags – More common in semi-structured data.

c) No predefined format – Correct. Text, audio, video, etc.

d) Easy to query with SQL – SQL works with structured data.

e) Always numerical – Not true; can be text, images, etc.

---

10. In the CRISP-DM methodology, which phase follows business understanding?

Answer: e) Data understanding


Explanation:

a) Data preparation – Happens after understanding data.


b) Modeling – Later step.

c) Evaluation – Near the end.

d) Deployment – Final stage.

e) Data understanding – Correct. Second phase after defining business goals

21. In regression, multicollinearity refers to:

Answer: b) High correlation among independent variables

a) Outliers in data – Different issue.

b) Correct. It causes instability in coefficient estimates.

c) Non-linear relationships – Not relevant here.

d) Missing values – Separate problem.

e) Response variable variability – Not multicollinearity.

---

22. The term “dimensionality reduction” refers to:

Answer: c) Reducing number of features

a) Normalizing data – Scaling.

b) Reducing file size – Possible side effect.

c) Correct. It simplifies models and reduces overfitting.

d) Increasing model complexity – Opposite goal.

e) Reducing training time – A benefit, not the definition.

---
23. Which of the following is a supervised learning task?

Answer: a) Classification

a) Correct. Labeled data is used to assign categories.

b) Clustering – Unsupervised.

c) Dimensionality reduction – Often unsupervised.

d) Association rule mining – Unsupervised.

e) Topic modeling – Unsupervised.

---

24. What does overfitting refer to in machine learning?

Answer: c) Model fits training data too well

a) Too few features – May cause underfitting.

b) Low training accuracy – Actually sign of underfitting.

c) Correct. Overfitted models perform poorly on new data.

d) High generalization – This is desired, not a problem.

e) Limited data – Can contribute, but not definition.

---

25. Which data visualization is best for showing parts of a whole?

Answer: b) Pie chart

a) Scatter plot – Relationships.

b) Correct. Pie charts show proportions.


c) Histogram – Distribution.

d) Box plot – Quartiles and spread.

e) Line graph – Time trends.

---

26. Which is NOT a component of the data science Venn diagram (Drew Conway)?

Answer: e) Graphic design

a) Hacking skills

b) Math/stat knowledge

c) Substantive expertise

d) All of the above are included

e) Correct. Graphic design is not part of the core Venn.

---

27. Which database system is NoSQL?

Answer: d) MongoDB

a) MySQL – SQL.

b) PostgreSQL – SQL.

c) Oracle – SQL.

d) Correct. MongoDB stores JSON-like documents.

e) SQLite – SQL-based.
---

28. Which data format is commonly used for data interchange?

Answer: b) JSON

a) DOCX – Document format.

b) Correct. JSON is lightweight, widely used in APIs.

c) EXE – Executable file.

d) MP3 – Audio.

e) PNG – Image.

---

29. A confusion matrix is used in:

Answer: c) Classification

a) Clustering – Different metrics.

b) Regression – Use RMSE, R².

c) Correct. It shows true/false positives/negatives.

d) Time series – Use forecasting metrics.

e) Dimensionality reduction – Not applicable.

---

30. Which is a common performance metric for regression?

Answer: d) RMSE

a) Accuracy – For classification.


b) Precision – Classification.

c) Recall – Classification.

d) Correct. RMSE = root mean squared error.

e) F1 Score – Classification.

---

31. A benefit of using cloud platforms for data science is:

Answer: b) Scalability

a) Hardware replacement – Not a direct benefit.

b) Correct. Cloud offers flexible compute/storage.

c) Offline data access – Often limited.

d) Guaranteed model accuracy – Depends on model, not platform.

e) Source code encryption – Not a main feature.

---

32. Which is an example of categorical data?

Answer: c) Eye color

a) Height – Numerical.

b) Age – Numerical.

c) Correct. Categories like blue, green.

d) Temperature – Continuous.

e) Weight – Continuous.
---

33. In k-means clustering, ‘k’ refers to:

Answer: b) Number of clusters

a) Features – Columns.

b) Correct. It defines how many groups to form.

c) Data points – Rows.

d) Iterations – Not related directly.

e) Distance metric – Used internally.

---

34. Which library is used for data visualization in Python?

Answer: b) Matplotlib

a) NumPy – Numerical ops.

b) Correct. Basic plotting library.

c) Scikit-learn – Machine learning.

d) TensorFlow – Deep learning.

e) Flask – Web apps.

---

35. Which of the following is an ensemble method?

Answer: c) Random Forest


a) Naive Bayes – Single model.

b) KNN – Lazy learner.

c) Correct. Random Forest uses many trees.

d) Logistic regression – Individual model.

e) PCA – Dimensionality reduction.

---

36. Which is NOT a type of machine learning?

Answer: e) Exhaustive learning

a) Supervised – Yes.

b) Unsupervised – Yes.

c) Reinforcement – Yes.

d) Semi-supervised – Yes.

e) Correct. Exhaustive learning is not a real type.

---

37. A data lake stores:

Answer: b) Raw, unprocessed data

a) Only structured data – False.

b) Correct. Data lakes store raw data in any format.

c) Only cleaned data – That’s a warehouse.

d) Tabular data only – False.


e) Secure documents – May include, but not purpose.

---

38. Which tool helps automate machine learning workflows?

Answer: b) AutoML

a) Excel – Manual.

b) Correct. AutoML automates model building/tuning.

c) MS Word – Not for ML.

d) Tableau – Visualization.

e) Hadoop – Processing platform.

---

39. Which library supports numerical computing in Python?

Answer: a) NumPy

a) Correct. Core numerical array ops.

b) Flask – Web.

c) Seaborn – Visualization.

d) Pandas – Dataframes.

e) PyTorch – Deep learning.

---

40. What is the role of the activation function in neural networks?


Answer: c) Introduces non-linearity

a) Reduces overfitting – Use regularization.

b) Initializes weights – Done separately.

c) Correct. Allows complex patterns.

d) Normalizes outputs – Use batchnorm.

e) Increases dataset size – Not related.

41. The bias-variance tradeoff refers to:

Answer: b) Model generalization

a) Ethical considerations – No.

b) Correct. Balance between underfitting (high bias) and overfitting (high variance).

c) Data storage costs – No relation.

d) Data collection methods – No.

e) Visualization clarity – No.

---

42. Which technique helps address class imbalance?

Answer: a) SMOTE

a) Correct. Synthetic Minority Over-sampling Technique generates synthetic samples.

b) Normalization – Scaling data, no class balance.

c) One-hot encoding – Encoding categorical vars.

d) Tokenization – NLP text splitting.


e) Aggregation – Summarizing data.

---

43. In NLP, stop word removal is an example of:

Answer: a) Data cleaning

a) Correct. Stop words are common words removed to reduce noise.

b) Model training – Not relevant here.

c) Visualization – No.

d) Deployment – No.

e) Hardware setup – No.

---

44. Which cloud service provides BigQuery analytics?

Answer: b) Google Cloud

a) AWS – Has Redshift.

b) Correct. BigQuery is Google Cloud’s data warehouse.

c) Azure – Has Synapse.

d) IBM Cloud – Different services.

e) Oracle – Different platform.

---

45. The primary goal of dimensionality reduction is to:


Answer: b) Simplify datasets

a) Increase data volume – No.

b) Correct. Reduce features to improve models and speed.

c) Create backups – No.

d) Improve storage – Side effect.

e) Encrypt data – No.

---

46. Which algorithm uses gradient boosting?

Answer: b) XGBoost

a) KNN – No boosting.

b) Correct. XGBoost is a gradient boosting framework.

c) K-means – Clustering.

d) SVM – Margin based.

e) Apriori – Association rules.

---

47. In data visualization, the “data-ink ratio” emphasizes:

Answer: b) Informative elements

a) Color variety – No.

b) Correct. Ratio of ink used for data vs. decoration.

c) 3D effects – Usually reduces data-ink ratio.


d) Animation – No.

e) Font styles – No.

---

48. Which phase converts raw data into analysis-ready format?

Answer: b) Preprocessing

a) Collection – Gathering raw data.

b) Correct. Cleaning, transforming data.

c) Modeling – Building models.

d) Deployment – Putting models into use.

e) Monitoring – Tracking model performance.

---

49. The term “MapReduce” is associated with:

Answer: b) Distributed processing

a) Data visualization – No.

b) Correct. It’s a programming model for distributed computing.

c) Neural networks – No.

d) Statistical testing – No.

e) Database design – No.

---
50. Which is NOT a characteristic of big data?

Answer: d) Validity

a) Volume – Yes.

b) Velocity – Yes.

c) Variety – Yes.

d) Correct. Validity is not a standard “V” of big data.

e) Veracity – Yes.

---

51. In computer vision, YOLO is used for:

Answer: b) Object detection

a) Image classification – No.

b) Correct. YOLO (You Only Look Once) detects objects in images.

c) Style transfer – No.

d) Data augmentation – No.

e) Noise reduction – No.

---

52. Which tool is used for containerization in deployment?

Answer: a) Docker

a) Correct. Docker packages applications in containers.

b) Tableau – Visualization.
c) Excel – Spreadsheet.

d) Hadoop – Big data processing.

e) Spark – Big data processing.

---

53. The F1-score combines which two metrics?

Answer: b) Precision & Recall

a) Accuracy & Precision – No.

b) Correct. Harmonic mean of precision and recall.

c) Recall & Specificity – No.

d) RMSE & MAE – Regression metrics.

e) Variance & Covariance – Statistical metrics.

---

54. Which data science role focuses on data pipelines?

Answer: b) Data Engineer

a) Business Analyst – Focuses on analysis, not pipelines.

b) Correct. Builds and maintains data workflows.

c) UX Designer – Interface design.

d) Network Admin – Infrastructure.

e) Security Specialist – Security.


---

55. The term “shuffling” in data processing refers to:

Answer: b) Randomizing data order

a) Data encryption – No.

b) Correct. Helps prevent order bias during training.

c) Deleting duplicates – Different.

d) Compression – No.

e) Visualization – No.

---

56. Which algorithm is unsupervised?

Answer: b) K-means

a) Linear Regression – Supervised.

b) Correct. K-means clusters unlabeled data.

c) SVM – Supervised.

d) Logistic Regression – Supervised.

e) Decision Trees – Supervised.

---

57. In time-series analysis, autocorrelation measures:

Answer: b) Relationship with past values

a) Data storage needs – No.


b) Correct. Correlation between time-lagged values.

c) Missing data – No.

d) Feature importance – No.

e) Model complexity – No.

---

58. Which Python library is used for deep learning?

Answer: c) TensorFlow

a) Pandas – Dataframes.

b) NumPy – Numerical.

c) Correct. TensorFlow is for deep learning.

d) Matplotlib – Visualization.

e) Scikit-learn – ML, but not deep learning focused.

---

59. The term “overfitting” occurs when a model:

Answer: a) Performs well on training data only

a) Correct. Model fails to generalize.

b) Generalizes to new data – Opposite.

c) Ignores important features – No.

d) Runs too slowly – No.

e) Requires less memory – No.


---

60. Which technique creates synthetic data for testing?

Answer: a) Bootstrapping

a) Correct. Resampling technique to create synthetic samples.

b) Imputation – Fills missing data.

c) Normalization – Scaling.

d) Tokenization – Text splitting.

e) Encryption – Data security.

You might also like