STATISTICS

For each textbook chapter shown in the course outline, you will complete the following problems.

Ch. 1

9, 18, 27, 36, 45, 54, 63, 72, 81, 90

Ch. 2

9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99, 108, 117

Ch. 3

9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126

Ch. 4

5, 10, 15, 20, 25, 30, 35, 40, 69, 70, 71, 72, 73, 74

Introductory Statistics

SENIOR CONTRIBUTING AUTHORS BARBARA ILLOWSKY, DE ANZA COLLEGE SUSAN DEAN, DE ANZA COLLEGE

 

 

 

 

OpenStax Rice University 6100 Main Street MS-375 Houston, Texas 77005 To learn more about OpenStax, visit https://openstax.org. Individual print copies and bulk orders can be purchased through our website. ©2018 Rice University. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Under this license, any user of this textbook or the textbook contents herein must provide proper attribution as follows:

– If you redistribute this textbook in a digital format (including but not limited to PDF and HTML), then you must retain on every page the following attribution: “Download for free at https://openstax.org/details/books/introductory-statistics.”

– If you redistribute this textbook in a print format, then you must include on every physical page the following attribution: “Download for free at https://openstax.org/details/books/introductory-statistics.”

– If you redistribute part of this textbook, then you must retain in every digital format page view (including but not limited to PDF and HTML) and on every physical printed page the following attribution: “Download for free at https://openstax.org/details/books/introductory-statistics.”

– If you use this textbook as a bibliographic reference, please include https://openstax.org/details/books/introductory-statistics in your citation.

For questions regarding this licensing, please contact support@openstax.org. Trademarks The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, OpenStax CNX logo, OpenStax Tutor name, Openstax Tutor logo, Connexions name, Connexions logo, Rice University name, and Rice University logo are not subject to the license and may not be reproduced without the prior and express written consent of Rice University. PRINT BOOK ISBN-10 1-938168-20-8 PRINT BOOK ISBN-13 978-1-938168-20-8 PDF VERSION ISBN-10 1-947172-05-0 PDF VERSION ISBN-13 978-1-947172-05-0 ENHANCED TEXTBOOK PART 1 ISBN-10 1-938168-29-1 ENHANCED TEXTBOOK PART 1 ISBN-13 978-1-938168-29-1 Revision Number ST-2013-002(03/18)-LC Original Publication Year 2013

 

 

 

 

OPENSTAX OpenStax provides free, peer-reviewed, openly licensed textbooks for introductory college and Advanced Placement® courses and low-cost, personalized courseware that helps students learn. A nonprofit ed tech initiative based at Rice University, we’re committed to helping students access the tools they need to complete their courses and meet their educational goals.

RICE UNIVERSITY OpenStax, OpenStax CNX, and OpenStax Tutor are initiatives of Rice University. As a leading research university with a distinctive commitment to undergraduate education, Rice University aspires to path-breaking research, unsurpassed teaching, and contributions to the betterment of our world. It seeks to fulfill this mission by cultivating a diverse community of learning and discovery that produces leaders across the spectrum of human endeavor.

 

FOUNDATION SUPPORT OpenStax is grateful for the tremendous support of our sponsors. Without their strong engagement, the goal of free access to high-quality textbooks would remain just a dream.

Laura and John Arnold Foundation (LJAF) actively seeks opportunities to invest in organizations and thought leaders that have a sincere interest in implementing fundamental changes that not only yield immediate gains, but also repair broken systems for future generations. LJAF currently focuses its strategic investments on education, criminal justice, research integrity, and public accountability.

The William and Flora Hewlett Foundation has been making grants since 1967 to help solve social and environmental problems at home and around the world. The Foundation concentrates its resources on activities in education, the environment, global development and population, performing arts, and philanthropy, and makes grants to support disadvantaged communities in the San Francisco Bay Area. Calvin K. Kazanjian was the founder and president of Peter Paul (Almond Joy), Inc. He firmly believed that the more people understood about basic economics the happier and more prosperous they would be. Accordingly, he established the Calvin K. Kazanjian Economics Foundation Inc, in 1949 as a philanthropic, nonpolitical educational organization to support efforts that enhanced economic understanding.

Guided by the belief that every life has equal value, the Bill & Melinda Gates Foundation works to help all people lead healthy, productive lives. In developing countries, it focuses on improving people’s health with vaccines and other life-saving tools and giving them the chance to lift themselves out of hunger and extreme poverty. In the United States, it seeks to significantly improve education so that all young people have the opportunity to reach their full potential. Based in Seattle, Washington, the foundation is led by CEO Jeff Raikes and Co-chair William H. Gates Sr., under the direction of Bill and Melinda Gates and Warren Buffett. The Maxfield Foundation supports projects with potential for high impact in science, education, sustainability, and other areas of social importance.

Our mission at The Michelson 20MM Foundation is to grow access and success by eliminating unnecessary hurdles to affordability. We support the creation, sharing, and proliferation of more effective, more affordable educational content by leveraging disruptive technologies, open educational resources, and new models for collaboration between for-profit, nonprofit, and public entities. The Bill and Stephanie Sick Fund supports innovative projects in the areas of Education, Art, Science and Engineering.

 

 

 

Access. The future of education.

OpenStax.org

I like free textbooks and I cannot lie.

Give $5 or more to OpenStax and we’ll send you a sticker! OpenStax is a nonprofit initiative, which means that that every dollar you give helps us maintain and grow our library of free textbooks.

If you have a few dollars to spare, visit OpenStax.org/give to donate. We’ll send you an OpenStax sticker to thank you for your support!

 

 

Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1: Sampling and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1 Definitions of Statistics, Probability, and Key Terms . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Data, Sampling, and Variation in Data and Sampling . . . . . . . . . . . . . . . . . . . . . 10 1.3 Frequency, Frequency Tables, and Levels of Measurement . . . . . . . . . . . . . . . . . . 26 1.4 Experimental Design and Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.5 Data Collection Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 1.6 Sampling Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Chapter 2: Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.1 Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs . . . . . . . . . . . . . . 68 2.2 Histograms, Frequency Polygons, and Time Series Graphs . . . . . . . . . . . . . . . . . . 77 2.3 Measures of the Location of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 2.4 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 2.5 Measures of the Center of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 2.6 Skewness and the Mean, Median, and Mode . . . . . . . . . . . . . . . . . . . . . . . . . 106 2.7 Measures of the Spread of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 2.8 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Chapter 3: Probability Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 3.2 Independent and Mutually Exclusive Events . . . . . . . . . . . . . . . . . . . . . . . . . . 181 3.3 Two Basic Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 3.4 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 3.5 Tree and Venn Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 3.6 Probability Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Chapter 4: Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 4.1 Probability Distribution Function (PDF) for a Discrete Random Variable . . . . . . . . . . . 244 4.2 Mean or Expected Value and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . 247 4.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 4.4 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 4.5 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 4.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 4.7 Discrete Distribution (Playing Card Experiment) . . . . . . . . . . . . . . . . . . . . . . . . 271 4.8 Discrete Distribution (Lucky Dice Experiment) . . . . . . . . . . . . . . . . . . . . . . . . . 274

Chapter 5: Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 5.1 Continuous Probability Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 5.2 The Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 5.3 The Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 5.4 Continuous Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

Chapter 6: The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 6.1 The Standard Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 6.2 Using the Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 6.3 Normal Distribution (Lap Times) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 6.4 Normal Distribution (Pinkie Length) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

Chapter 7: The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 7.1 The Central Limit Theorem for Sample Means (Averages) . . . . . . . . . . . . . . . . . . 400 7.2 The Central Limit Theorem for Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 7.3 Using the Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 7.4 Central Limit Theorem (Pocket Change) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 7.5 Central Limit Theorem (Cookie Recipes) . . . . . . . . . . . . . . . . . . . . . . . . . . . 420

Chapter 8: Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 8.1 A Single Population Mean using the Normal Distribution . . . . . . . . . . . . . . . . . . . 445 8.2 A Single Population Mean using the Student t Distribution . . . . . . . . . . . . . . . . . . 456 8.3 A Population Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 8.4 Confidence Interval (Home Costs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 8.5 Confidence Interval (Place of Birth) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 8.6 Confidence Interval (Women’s Heights) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

Chapter 9: Hypothesis Testing with One Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

 

 

9.1 Null and Alternative Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 9.2 Outcomes and the Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 508 9.3 Distribution Needed for Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 510 9.4 Rare Events, the Sample, Decision and Conclusion . . . . . . . . . . . . . . . . . . . . . . 511 9.5 Additional Information and Full Hypothesis Test Examples . . . . . . . . . . . . . . . . . . 514 9.6 Hypothesis Testing of a Single Mean and Single Proportion . . . . . . . . . . . . . . . . . . 530

Chapter 10: Hypothesis Testing with Two Samples . . . . . . . . . . . . . . . . . . . . . . . . . 567 10.1 Two Population Means with Unknown Standard Deviations . . . . . . . . . . . . . . . . . 568 10.2 Two Population Means with Known Standard Deviations . . . . . . . . . . . . . . . . . . 576 10.3 Comparing Two Independent Population Proportions . . . . . . . . . . . . . . . . . . . . 579 10.4 Matched or Paired Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 10.5 Hypothesis Testing for Two Means and Two Proportions . . . . . . . . . . . . . . . . . . . 590

Chapter 11: The Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 11.1 Facts About the Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 11.2 Goodness-of-Fit Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623 11.3 Test of Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 11.4 Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 11.5 Comparison of the Chi-Square Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641 11.6 Test of a Single Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641 11.7 Lab 1: Chi-Square Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644 11.8 Lab 2: Chi-Square Test of Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 648

Chapter 12: Linear Regression and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 679 12.1 Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680 12.2 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 12.3 The Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 12.4 Testing the Significance of the Correlation Coefficient . . . . . . . . . . . . . . . . . . . . 691 12.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696 12.6 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697 12.7 Regression (Distance from School) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704 12.8 Regression (Textbook Cost) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706 12.9 Regression (Fuel Efficiency) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708

Chapter 13: F Distribution and One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . 743 13.1 One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744 13.2 The F Distribution and the F-Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745 13.3 Facts About the F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749 13.4 Test of Two Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756 13.5 Lab: One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759

Appendix A: Review Exercises (Ch 3-13) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785 Appendix B: Practice Tests (1-4) and Final Exams . . . . . . . . . . . . . . . . . . . . . . . . . . 813 Appendix C: Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869 Appendix D: Group and Partner Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873 Appendix E: Solution Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879 Appendix F: Mathematical Phrases, Symbols, and Formulas . . . . . . . . . . . . . . . . . . . . 883 Appendix G: Notes for the TI-83, 83+, 84, 84+ Calculators . . . . . . . . . . . . . . . . . . . . . . 889 Appendix H: Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

PREFACE Welcome to Introductory Statistics, an OpenStax resource. This textbook was written to increase student access to high- quality learning materials, maintaining highest standards of academic rigor at little to no cost.

The foundation of this textbook is Collaborative Statistics, by Barbara Illowsky and Susan Dean. Additional topics, examples, and innovations in terminology and practical applications have been added, all with a goal of increasing relevance and accessibility for students.

About OpenStax OpenStax is a nonprofit based at Rice University, and it’s our mission to improve student access to education. Our first openly licensed college textbook was published in 2012, and our library has since scaled to over 25 books for college and AP® courses used by hundreds of thousands of students. OpenStax Tutor, our low-cost personalized learning tool, is being used in college courses throughout the country. Through our partnerships with philanthropic foundations and our alliance with other educational resource organizations, OpenStax is breaking down the most common barriers to learning and empowering students and instructors to succeed.

About OpenStax’s resources Customization Introductory Statistics is licensed under a Creative Commons Attribution 4.0 International (CC BY) license, which means that you can distribute, remix, and build upon the content, as long as you provide attribution to OpenStax and its content contributors.

Because our books are openly licensed, you are free to use the entire book or pick and choose the sections that are most relevant to the needs of your course. Feel free to remix the content by assigning your students certain chapters and sections in your syllabus, in the order that you prefer. You can even provide a direct link in your syllabus to the sections in the web view of your book.

Instructors also have the option of creating a customized version of their OpenStax book. The custom version can be made available to students in low-cost print or digital form through their campus bookstore. Visit your book page on OpenStax.org for more information.

Errata All OpenStax textbooks undergo a rigorous review process. However, like any professional-grade textbook, errors sometimes occur. Since our books are web based, we can make updates periodically when deemed pedagogically necessary. If you have a correction to suggest, submit it through the link on your book page on OpenStax.org. Subject matter experts review all errata suggestions. OpenStax is committed to remaining transparent about all updates, so you will also find a list of past errata changes on your book page on OpenStax.org.

Format You can access this textbook for free in web view or PDF through OpenStax.org, and in low-cost print and iBooks editions.

Coverage and scope Chapter 1 Sampling and Data Chapter 2 Descriptive Statistics Chapter 3 Probability Topics Chapter 4 Discrete Random Variables Chapter 5 Continuous Random Variables Chapter 6 The Normal Distribution Chapter 7 The Central Limit Theorem Chapter 8 Confidence Intervals Chapter 9 Hypothesis Testing with One Sample Chapter 10 Hypothesis Testing with Two Samples Chapter 11 The Chi-Square Distribution Chapter 12 Linear Regression and Correlation Chapter 13 F Distribution and One-Way ANOVA

Preface 1

 

 

Alternate sequencing Introductory Statistics was conceived and written to fit a particular topical sequence, but it can be used flexibly to accommodate other course structures. One such potential structure, which fits reasonably well with the textbook content, is provided below. Please consider, however, that the chapters were not written to be completely independent, and that the proposed alternate sequence should be carefully considered for student preparation and textual consistency.

Chapter 1 Sampling and Data Chapter 2 Descriptive Statistics Chapter 12 Linear Regression and Correlation Chapter 3 Probability Topics Chapter 4 Discrete Random Variables Chapter 5 Continuous Random Variables Chapter 6 The Normal Distribution Chapter 7 The Central Limit Theorem Chapter 8 Confidence Intervals Chapter 9 Hypothesis Testing with One Sample Chapter 10 Hypothesis Testing with Two Samples Chapter 11 The Chi-Square Distribution Chapter 13 F Distribution and One-Way ANOVA

Pedagogical foundation and features • Examples are placed strategically throughout the text to show students the step-by-step process of interpreting and

solving statistical problems. To keep the text relevant for students, the examples are drawn from a broad spectrum of practical topics, including examples about college life and learning, health and medicine, retail and business, and sports and entertainment.

• Try It practice problems immediately follow many examples and give students the opportunity to practice as they read the text. They are usually based on practical and familiar topics, like the Examples themselves.

• Collaborative Exercises provide an in-class scenario for students to work together to explore presented concepts.

• Using the TI-83, 83+, 84, 84+ Calculator shows students step-by-step instructions to input problems into their calculator.

• The Technology Icon indicates where the use of a TI calculator or computer software is recommended.

• Practice, Homework, and Bringing It Together problems give the students problems at various degrees of difficulty while also including real-world scenarios to engage students.

Statistics labs These innovative activities were developed by Barbara Illowsky and Susan Dean in order to offer students the experience of designing, implementing, and interpreting statistical analyses. They are drawn from actual experiments and data-gathering processes and offer a unique hands-on and collaborative experience. The labs provide a foundation for further learning and classroom interaction that will produce a meaningful application of statistics.

Statistics Labs appear at the end of each chapter and begin with student learning outcomes, general estimates for time on task, and any global implementation notes. Students are then provided with step-by-step guidance, including sample data tables and calculation prompts. The detailed assistance will help the students successfully apply the concepts in the text and lay the groundwork for future collaborative or individual work.

Additional resources Student and instructor resources We’ve compiled additional resources for both students and instructors, including Getting Started Guides, an instructor solution manual, and PowerPoint slides. Instructor resources require a verified instructor account, which you can apply for when you log in or create your account on OpenStax.org. Take advantage of these resources to supplement your OpenStax book.

Community Hubs OpenStax partners with the Institute for the Study of Knowledge Management in Education (ISKME) to offer Community Hubs on OER Commons – a platform for instructors to share community-created resources that support OpenStax books, free of charge. Through our Community Hubs, instructors can upload their own materials or download resources to use in their own courses, including additional ancillaries, teaching material, multimedia, and relevant course content. We

2 Preface

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

encourage instructors to join the hubs for the subjects most relevant to your teaching and research as an opportunity both to enrich your courses and to engage with other faculty.

To reach the Community Hubs, visit www.oercommons.org/hubs/OpenStax.

Partner resources OpenStax Partners are our allies in the mission to make high-quality learning materials affordable and accessible to students and instructors everywhere. Their tools integrate seamlessly with our OpenStax titles at a low cost. To access the partner resources for your text, visit your book page on OpenStax.org.

About the authors Senior contributing authors Barbara Illowsky, De Anza College Susan Dean, De Anza College

Contributing authors Birgit Aquilonius, West Valley College Charles Ashbacher, Upper Iowa University, Cedar Rapids Abraham Biggs, Broward Community College Daniel Birmajer, Nazareth College Roberta Bloom, De Anza College Bryan Blount, Kentucky Wesleyan College Ernest Bonat, Portland Community College Sarah Boslaugh, Kennesaw State University David Bosworth, Hutchinson Community College Sheri Boyd, Rollins College George Bratton, University of Central Arkansas Jing Chang, College of Saint Mary Laurel Chiappetta, University of Pittsburgh Lenore Desilets, De Anza College Matthew Einsohn, Prescott College Ann Flanigan, Kapiolani Community College David French, Tidewater Community College Mo Geraghty, De Anza College Larry Green, Lake Tahoe Community College Michael Greenwich, College of Southern Nevada Inna Grushko, De Anza College Valier Hauber, De Anza College Janice Hector, De Anza College Jim Helmreich, Marist College Robert Henderson, Stephen F. Austin State University Mel Jacobsen, Snow College Mary Jo Kane, De Anza College Lynette Kenyon, Collin County Community College Charles Klein, De Anza College Alexander Kolovos Sheldon Lee, Viterbo University Sara Lenhart, Christopher Newport University Wendy Lightheart, Lane Community College Vladimir Logvenenko, De Anza College Jim Lucas, De Anza College Lisa Markus, De Anza College Miriam Masullo, SUNY Purchase Diane Mathios, De Anza College Robert McDevitt, Germanna Community College Mark Mills, Central College Cindy Moss, Skyline College Nydia Nelson, St. Petersburg College Benjamin Ngwudike, Jackson State University Jonathan Oaks, Macomb Community College

Preface 3

 

 

Carol Olmstead, De Anza College Adam Pennell, Greensboro College Kathy Plum, De Anza College Lisa Rosenberg, Elon University Sudipta Roy, Kankakee Community College Javier Rueda, De Anza College Yvonne Sandoval, Pima Community College Rupinder Sekhon, De Anza College Travis Short, St. Petersburg College Frank Snow, De Anza College Abdulhamid Sukar, Cameron University Jeffery Taub, Maine Maritime Academy Mary Teegarden, San Diego Mesa College John Thomas, College of Lake County Philip J. Verrecchia, York College of Pennsylvania Dennis Walsh, Middle Tennessee State University Cheryl Wartman, University of Prince Edward Island Carol Weideman, St. Petersburg College Andrew Wiesner, Pennsylvania State University

4 Preface

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

1 | SAMPLING AND DATA

Figure 1.1 We encounter statistics in our daily lives more often than we probably realize and from many different sources, like the news. (credit: David Sim)

Introduction

Chapter Objectives

By the end of this chapter, the student should be able to:

• Recognize and differentiate between key terms. • Apply various types of sampling methods to data collection. • Create and interpret frequency tables.

You are probably asking yourself the question, “When and where will I use statistics?” If you read any newspaper, watch television, or use the Internet, you will see statistical information. There are statistics about crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or watch a television news program, you are given sample information. With this information, you may make a decision about the correctness of a statement, claim, or “fact.” Statistical methods can help you make the “best educated guess.”

Since you will undoubtedly be given statistical information at some point in your life, you need to know some techniques for analyzing the information thoughtfully. Think about buying a house or managing a budget. Think about your chosen profession. The fields of economics, business, psychology, education, biology, law, computer science, police science, and early childhood development require at least one course in statistics.

Included in this chapter are the basic ideas and words of probability and statistics. You will soon understand that statistics and probability work together. You will also learn how data are gathered and what “good” data can be distinguished from “bad.”

1.1 | Definitions of Statistics, Probability, and Key Terms The science of statistics deals with the collection, analysis, interpretation, and presentation of data. We see and use data in our everyday lives.

Chapter 1 | Sampling and Data 5

 

 

In your classroom, try this exercise. Have class members write down the average time (in hours, to the nearest half- hour) they sleep per night. Your instructor will record the data. Then create a simple graph (called a dot plot) of the data. A dot plot consists of a number line and dots (or points) positioned above the number line. For example, consider the following data:

5; 5.5; 6; 6; 6; 6.5; 6.5; 6.5; 6.5; 7; 7; 8; 8; 9

The dot plot for this data would be as follows:

Figure 1.2

Does your dot plot look the same as or different from the example? Why? If you did the same example in an English class with the same number of students, do you think the results would be the same? Why or why not?

Where do your data appear to cluster? How might you interpret the clustering?

The questions above ask you to analyze and interpret your data. With this example, you have begun your study of statistics.

In this course, you will learn how to organize and summarize data. Organizing and summarizing data is called descriptive statistics. Two ways to summarize data are by graphing and by using numbers (for example, finding an average). After you have studied probability and probability distributions, you will use formal methods for drawing conclusions from “good” data. The formal methods are called inferential statistics. Statistical inference uses probability to determine how confident we can be that our conclusions are correct.

Effective interpretation of data (inference) is based on good procedures for producing data and thoughtful examination of the data. You will encounter what will seem to be too many mathematical formulas for interpreting data. The goal of statistics is not to perform numerous calculations using the formulas, but to gain an understanding of your data. The calculations can be done using a calculator or a computer. The understanding must come from you. If you can thoroughly grasp the basics of statistics, you can be more confident in the decisions you make in life.

Probability Probability is a mathematical tool used to study randomness. It deals with the chance (the likelihood) of an event occurring. For example, if you toss a fair coin four times, the outcomes may not be two heads and two tails. However, if you toss the same coin 4,000 times, the outcomes will be close to half heads and half tails. The expected theoretical probability of heads in any one toss is 12 or 0.5. Even though the outcomes of a few repetitions are uncertain, there is a regular pattern

of outcomes when there are many repetitions. After reading about the English statistician Karl Pearson who tossed a coin 24,000 times with a result of 12,012 heads, one of the authors tossed a coin 2,000 times. The results were 996 heads. The fraction 9962000 is equal to 0.498 which is very close to 0.5, the expected probability.

The theory of probability began with the study of games of chance such as poker. Predictions take the form of probabilities. To predict the likelihood of an earthquake, of rain, or whether you will get an A in this course, we use probabilities. Doctors use probability to determine the chance of a vaccination causing the disease the vaccination is supposed to prevent. A

6 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

stockbroker uses probability to determine the rate of return on a client’s investments. You might use probability to decide to buy a lottery ticket or not. In your study of statistics, you will use the power of mathematics through probability calculations to analyze and interpret your data.

Key Terms In statistics, we generally want to study a population. You can think of a population as a collection of persons, things, or objects under study. To study the population, we select a sample. The idea of sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population.

Because it takes a lot of time and money to examine an entire population, sampling is a very practical technique. If you wished to compute the overall grade point average at your school, it would make sense to select a sample of students who attend the school. The data collected from the sample would be the students’ grade point averages. In presidential elections, opinion poll samples of 1,000–2,000 people are taken. The opinion poll is supposed to represent the views of the people in the entire country. Manufacturers of canned carbonated drinks take samples to determine if a 16 ounce can contains 16 ounces of carbonated drink.

From the sample data, we can calculate a statistic. A statistic is a number that represents a property of the sample. For example, if we consider one math class to be a sample of the population of all math classes, then the average number of points earned by students in that one math class at the end of the term is an example of a statistic. The statistic is an estimate of a population parameter. A parameter is a numerical characteristic of the whole population that can be estimated by a statistic. Since we considered all math classes to be the population, then the average number of points earned per student over all the math classes is an example of a parameter.

One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter. The accuracy really depends on how well the sample represents the population. The sample must contain the characteristics of the population in order to be a representative sample. We are interested in both the sample statistic and the population parameter in inferential statistics. In a later chapter, we will use the sample statistic to test the validity of the established population parameter.

A variable, usually notated by capital letters such as X and Y, is a characteristic or measurement that can be determined for each member of a population. Variables may be numerical or categorical. Numerical variables take on values with equal units such as weight in pounds and time in hours. Categorical variables place the person or thing into a category. If we let X equal the number of points earned by one math student at the end of a term, then X is a numerical variable. If we let Y be a person’s party affiliation, then some examples of Y include Republican, Democrat, and Independent. Y is a categorical variable. We could do some math with values of X (calculate the average number of points earned, for example), but it makes no sense to do math with values of Y (calculating an average party affiliation makes no sense).

Data are the actual values of the variable. They may be numbers or they may be words. Datum is a single value.

Two words that come up often in statistics are mean and proportion. If you were to take three exams in your math classes and obtain scores of 86, 75, and 92, you would calculate your mean score by adding the three exam scores and dividing by three (your mean score would be 84.3 to one decimal place). If, in your math class, there are 40 students and 22 are men and 18 are women, then the proportion of men students is 2240 and the proportion of women students is

18 40 . Mean and

proportion are discussed in more detail in later chapters.

NOTE

The words ” mean” and ” average” are often used interchangeably. The substitution of one word for the other is common practice. The technical term is “arithmetic mean,” and “average” is technically a center location. However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.”

Example 1.1

Determine what the key terms refer to in the following study. We want to know the average (mean) amount of money first year college students spend at ABC College on school supplies that do not include books. We randomly surveyed 100 first year students at the college. Three of those students spent $150, $200, and $225, respectively.

Chapter 1 | Sampling and Data 7

 

 

Solution 1.1

The population is all first year students attending ABC College this term.

The sample could be all students enrolled in one section of a beginning statistics course at ABC College (although this sample may not represent the entire population).

The parameter is the average (mean) amount of money spent (excluding books) by first year college students at ABC College this term.

The statistic is the average (mean) amount of money spent (excluding books) by first year college students in the sample.

The variable could be the amount of money spent (excluding books) by one first year student. Let X = the amount of money spent (excluding books) by one first year student attending ABC College.

The data are the dollar amounts spent by the first year students. Examples of the data are $150, $200, and $225.

1.1 Determine what the key terms refer to in the following study. We want to know the average (mean) amount of money spent on school uniforms each year by families with children at Knoll Academy. We randomly survey 100 families with children in the school. Three of the families spent $65, $75, and $95, respectively.

Example 1.2

Determine what the key terms refer to in the following study.

A study was conducted at a local college to analyze the average cumulative GPA’s of students who graduated last year. Fill in the letter of the phrase that best describes each of the items below.

1. Population_____ 2. Statistic _____ 3. Parameter _____ 4. Sample _____ 5. Variable _____ 6. Data _____

a) all students who attended the college last year b) the cumulative GPA of one student who graduated from the college last year c) 3.65, 2.80, 1.50, 3.90 d) a group of students who graduated from the college last year, randomly selected e) the average cumulative GPA of students who graduated from the college last year f) all students who graduated from the college last year g) the average cumulative GPA of students in the study who graduated from the college last year

Solution 1.2 1. f; 2. g; 3. e; 4. d; 5. b; 6. c

8 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Example 1.3

Determine what the key terms refer to in the following study.

As part of a study designed to test the safety of automobiles, the National Transportation Safety Board collected and reviewed data about the effects of an automobile crash on test dummies. Here is the criterion they used:

Speed at which Cars Crashed Location of “drive” (i.e. dummies)

35 miles/hour Front Seat

Table 1.1

Cars with dummies in the front seats were crashed into a wall at a speed of 35 miles per hour. We want to know the proportion of dummies in the driver’s seat that would have had head injuries, if they had been actual drivers. We start with a simple random sample of 75 cars.

Solution 1.3

The population is all cars containing dummies in the front seat.

The sample is the 75 cars, selected by a simple random sample.

The parameter is the proportion of driver dummies (if they had been real people) who would have suffered head injuries in the population.

The statistic is proportion of driver dummies (if they had been real people) who would have suffered head injuries in the sample.

The variable X = the number of driver dummies (if they had been real people) who would have suffered head injuries.

The data are either: yes, had head injury, or no, did not.

Example 1.4

Determine what the key terms refer to in the following study.

An insurance company would like to determine the proportion of all medical doctors who have been involved in one or more malpractice lawsuits. The company selects 500 doctors at random from a professional directory and determines the number in the sample who have been involved in a malpractice lawsuit.

Solution 1.4

The population is all medical doctors listed in the professional directory.

The parameter is the proportion of medical doctors who have been involved in one or more malpractice suits in the population.

The sample is the 500 doctors selected at random from the professional directory.

The statistic is the proportion of medical doctors who have been involved in one or more malpractice suits in the sample.

The variable X = the number of medical doctors who have been involved in one or more malpractice suits.

The data are either: yes, was involved in one or more malpractice lawsuits, or no, was not.

Chapter 1 | Sampling and Data 9

 

 

Do the following exercise collaboratively with up to four people per group. Find a population, a sample, the parameter, the statistic, a variable, and data for the following study: You want to determine the average (mean) number of glasses of milk college students drink per day. Suppose yesterday, in your English class, you asked five students how many glasses of milk they drank the day before. The answers were 1, 0, 1, 3, and 4 glasses of milk.

1.2 | Data, Sampling, and Variation in Data and Sampling Data may come from a population or from a sample. Lowercase letters like x or y generally are used to represent data values. Most data can be put into the following categories:

• Qualitative

• Quantitative

Qualitative data are the result of categorizing or describing attributes of a population. Qualitative data are also often called categorical data. Hair color, blood type, ethnic group, the car a person drives, and the street a person lives on are examples of qualitative data. Qualitative data are generally described by words or letters. For instance, hair color might be black, dark brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Researchers often prefer to use quantitative data over qualitative data because it lends itself more easily to mathematical analysis. For example, it does not make sense to find an average hair color or blood type.

Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and number of students who take statistics are examples of quantitative data. Quantitative data may be either discrete or continuous.

All data that are the result of counting are called quantitative discrete data. These data take on only certain numerical values. If you count the number of phone calls you receive for each day of the week, you might get values such as zero, one, two, or three.

Data that are not only made up of counting numbers, but that may include fractions, decimals, or irrational numbers, are called quantitative continuous data. Continuous data are often the results of measurements like lengths, weights, or times. A list of the lengths in minutes for all the phone calls that you make in a week, with numbers like 2.4, 7.5, or 11.0, would be quantitative continuous data.

Example 1.5 Data Sample of Quantitative Discrete Data

The data are the number of books students carry in their backpacks. You sample five students. Two students carry three books, one student carries four books, one student carries two books, and one student carries one book. The numbers of books (three, four, two, and one) are the quantitative discrete data.

1.5 The data are the number of machines in a gym. You sample five gyms. One gym has 12 machines, one gym has 15 machines, one gym has ten machines, one gym has 22 machines, and the other gym has 20 machines. What type of data is this?

Example 1.6 Data Sample of Quantitative Continuous Data

The data are the weights of backpacks with books in them. You sample the same five students. The weights (in pounds) of their backpacks are 6.2, 7, 6.8, 9.1, 4.3. Notice that backpacks carrying three books can have different

10 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

weights. Weights are quantitative continuous data.

1.6 The data are the areas of lawns in square feet. You sample five houses. The areas of the lawns are 144 sq. feet, 160 sq. feet, 190 sq. feet, 180 sq. feet, and 210 sq. feet. What type of data is this?

Example 1.7

You go to the supermarket and purchase three cans of soup (19 ounces tomato bisque, 14.1 ounces lentil, and 19 ounces Italian wedding), two packages of nuts (walnuts and peanuts), four different kinds of vegetable (broccoli, cauliflower, spinach, and carrots), and two desserts (16 ounces pistachio ice cream and 32 ounces chocolate chip cookies).

Name data sets that are quantitative discrete, quantitative continuous, and qualitative.

Solution 1.7

One Possible Solution:

• The three cans of soup, two packages of nuts, four kinds of vegetables and two desserts are quantitative discrete data because you count them.

• The weights of the soups (19 ounces, 14.1 ounces, 19 ounces) are quantitative continuous data because you measure weights as precisely as possible.

• Types of soups, nuts, vegetables and desserts are qualitative data because they are categorical.

Try to identify additional data sets in this example.

Example 1.8

The data are the colors of backpacks. Again, you sample the same five students. One student has a red backpack, two students have black backpacks, one student has a green backpack, and one student has a gray backpack. The colors red, black, black, green, and gray are qualitative data.

1.8 The data are the colors of houses. You sample five houses. The colors of the houses are white, yellow, white, red, and white. What type of data is this?

NOTE

You may collect data as numbers and report it categorically. For example, the quiz scores for each student are recorded throughout the term. At the end of the term, the quiz scores are reported as A, B, C, D, or F.

Chapter 1 | Sampling and Data 11

 

 

Example 1.9

Work collaboratively to determine the correct data type (quantitative or qualitative). Indicate whether quantitative data are continuous or discrete. Hint: Data that are discrete often start with the words “the number of.”

a. the number of pairs of shoes you own

b. the type of car you drive

c. the distance it is from your home to the nearest grocery store

d. the number of classes you take per school year.

e. the type of calculator you use

f. weights of sumo wrestlers

g. number of correct answers on a quiz

h. IQ scores (This may cause some discussion.)

Solution 1.9 Items a, d, and g are quantitative discrete; items c, f, and h are quantitative continuous; items b and e are qualitative, or categorical.

1.9 Determine the correct data type (quantitative or qualitative) for the number of cars in a parking lot. Indicate whether quantitative data are continuous or discrete.

Example 1.10

A statistics professor collects information about the classification of her students as freshmen, sophomores, juniors, or seniors. The data she collects are summarized in the pie chart Figure 1.2. What type of data does this graph show?

Figure 1.3

12 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Solution 1.10 This pie chart shows the students in each year, which is qualitative (or categorical) data.

1.10 The registrar at State University keeps records of the number of credit hours students complete each semester. The data he collects are summarized in the histogram. The class boundaries are 10 to less than 13, 13 to less than 16, 16 to less than 19, 19 to less than 22, and 22 to less than 25.

Figure 1.4

What type of data does this graph show?

Qualitative Data Discussion Below are tables comparing the number of part-time and full-time students at De Anza College and Foothill College enrolled for the spring 2010 quarter. The tables display counts (frequencies) and percentages or proportions (relative frequencies). The percent columns make comparing the same categories in the colleges easier. Displaying percentages along with the numbers is often helpful, but it is particularly important when comparing sets of data that do not have the same totals, such as the total enrollments for both colleges in this example. Notice how much larger the percentage for part-time students at Foothill College is compared to De Anza College.

De Anza College Foothill College

Number Percent Number Percent

Full-time 9,200 40.9% Full-time 4,059 28.6%

Part-time 13,296 59.1% Part-time 10,124 71.4%

Total 22,496 100% Total 14,183 100%

Table 1.2 Fall Term 2007 (Census day)

Chapter 1 | Sampling and Data 13

 

 

Tables are a good way of organizing and displaying data. But graphs can be even more helpful in understanding the data. There are no strict rules concerning which graphs to use. Two graphs that are used to display qualitative data are pie charts and bar graphs.

In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category.

In a bar graph, the length of the bar for each category is proportional to the number or percent of individuals in each category. Bars may be vertical or horizontal.

A Pareto chart consists of bars that are sorted into order by category size (largest to smallest).

Look at Figure 1.5 and Figure 1.6 and determine which graph (pie or bar) you think displays the comparisons better.

It is a good idea to look at a variety of graphs to see which is the most helpful in displaying the data. We might make different choices of what we think is the “best” graph depending on the data and the context. Our choice also depends on what we are using the data for.

(a) (b) Figure 1.5

Figure 1.6

14 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Percentages That Add to More (or Less) Than 100% Sometimes percentages add up to be more than 100% (or less than 100%). In the graph, the percentages add to more than 100% because students can be in more than one category. A bar graph is appropriate to compare the relative size of the categories. A pie chart cannot be used. It also could not be used if the percentages added to less than 100%.

Characteristic/Category Percent

Full-Time Students 40.9%

Students who intend to transfer to a 4-year educational institution 48.6%

Students under age 25 61.0%

TOTAL 150.5%

Table 1.3 De Anza College Spring 2010

Figure 1.7

Omitting Categories/Missing Data The table displays Ethnicity of Students but is missing the “Other/Unknown” category. This category contains people who did not feel they fit into any of the ethnicity categories or declined to respond. Notice that the frequencies do not add up to the total number of students. In this situation, create a bar graph and not a pie chart.

Frequency Percent

Asian 8,794 36.1%

Black 1,412 5.8%

Filipino 1,298 5.3%

Hispanic 4,180 17.1%

Native American 146 0.6%

Pacific Islander 236 1.0%

White 5,978 24.5%

TOTAL 22,044 out of 24,382 90.4% out of 100%

Table 1.4 Ethnicity of Students at De Anza College Fall Term 2007 (Census Day)

Chapter 1 | Sampling and Data 15

 

 

Figure 1.8

The following graph is the same as the previous graph but the “Other/Unknown” percent (9.6%) has been included. The “Other/Unknown” category is large compared to some of the other categories (Native American, 0.6%, Pacific Islander 1.0%). This is important to know when we think about what the data are telling us.

This particular bar graph in Figure 1.9 can be difficult to understand visually. The graph in Figure 1.10 is a Pareto chart. The Pareto chart has the bars sorted from largest to smallest and is easier to read and interpret.

Figure 1.9 Bar Graph with Other/Unknown Category

16 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Figure 1.10 Pareto Chart With Bars Sorted by Size

Pie Charts: No Missing Data The following pie charts have the “Other/Unknown” category included (since the percentages must add to 100%). The chart in Figure 1.11b is organized by the size of each wedge, which makes it a more visually informative graph than the unsorted, alphabetical graph in Figure 1.11a.

(a) (b)

Figure 1.11

Sampling Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing. Most statisticians use various methods of random sampling in an attempt to achieve this goal. This section will describe a few of the most common methods. There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample. Each method has pros and cons. The easiest method to describe is called a simple random sample. Any group of n individuals is equally likely to be chosen as any other group of n individuals if the simple random sampling technique is used. In other words, each sample of the same size has an equal chance of being selected. For example, suppose Lisa wants to form a four-person study group (herself and three other people) from her pre-calculus class, which has 31 members not including Lisa. To choose a simple random sample of size three from the other members of her class, Lisa could put all 31 names in a hat, shake the hat, close her eyes, and pick out three names. A more technological way is for Lisa to first list the last names of the members of her class together with a two-digit number, as in Table 1.5:

Chapter 1 | Sampling and Data 17

 

 

ID Name ID Name ID Name

00 Anselmo 11 King 21 Roquero

01 Bautista 12 Legeny 22 Roth

02 Bayani 13 Lundquist 23 Rowell

03 Cheng 14 Macierz 24 Salangsang

04 Cuarismo 15 Motogawa 25 Slade

05 Cuningham 16 Okimoto 26 Stratcher

06 Fontecha 17 Patel 27 Tallai

07 Hong 18 Price 28 Tran

08 Hoobler 19 Quizon 29 Wai

09 Jiao 20 Reyes 30 Wood

10 Khan

Table 1.5 Class Roster

Lisa can use a table of random numbers (found in many statistics books and mathematical handbooks), a calculator, or a computer to generate random numbers. For this example, suppose Lisa chooses to generate random numbers from a calculator. The numbers generated are as follows:

0.94360; 0.99832; 0.14669; 0.51470; 0.40581; 0.73381; 0.04399

Lisa reads two-digit groups until she has chosen three class members (that is, she reads 0.94360 as the groups 94, 43, 36, 60). Each random number may only contribute one class member. If she needed to, Lisa could have generated more random numbers.

The random numbers 0.94360 and 0.99832 do not contain appropriate two digit numbers. However the third random number, 0.14669, contains 14 (the fourth random number also contains 14), the fifth random number contains 05, and the seventh random number contains 04. The two-digit number 14 corresponds to Macierz, 05 corresponds to Cuningham, and 04 corresponds to Cuarismo. Besides herself, Lisa’s group will consist of Marcierz, Cuningham, and Cuarismo.

To generate random numbers:

• Press MATH.

• Arrow over to PRB.

• Press 5:randInt(. Enter 0, 30).

• Press ENTER for the first random number.

• Press ENTER two more times for the other 2 random numbers. If there is a repeat press ENTER again.

Note: randInt(0, 30, 3) will generate 3 random numbers.

18 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Figure 1.12

Besides simple random sampling, there are other forms of sampling that involve a chance process for getting the sample. Other well-known random sampling methods are the stratified sample, the cluster sample, and the systematic sample.

To choose a stratified sample, divide the population into groups called strata and then take a proportionate number from each stratum. For example, you could stratify (group) your college population by department and then choose a proportionate simple random sample from each stratum (each department) to get a stratified random sample. To choose a simple random sample from each department, number each member of the first department, number each member of the second department, and do the same for the remaining departments. Then use simple random sampling to choose proportionate numbers from the first department and do the same for each of the remaining departments. Those numbers picked from the first department, picked from the second department, and so on represent the members who make up the stratified sample.

To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your college population, the four departments make up the cluster sample. Divide your college faculty by department. The departments are the clusters. Number each department, and then choose four different numbers using simple random sampling. All members of the four departments with those numbers are the cluster sample.

To choose a systematic sample, randomly select a starting point and take every nth piece of data from a listing of the population. For example, suppose you have to do a phone survey. Your phone book contains 20,000 residence listings. You must choose 400 names for the sample. Number the population 1–20,000 and then use a simple random sample to pick a number that represents the first name in the sample. Then choose every fiftieth name thereafter until you have a total of 400 names (you might have to go back to the beginning of your phone list). Systematic sampling is frequently chosen because it is a simple method.

A type of sampling that is non-random is convenience sampling. Convenience sampling involves using results that are readily available. For example, a computer software store conducts a marketing study by interviewing potential customers who happen to be in the store browsing through the available software. The results of convenience sampling may be very good in some cases and highly biased (favor certain outcomes) in others.

Sampling data should be done very carefully. Collecting data carelessly can have devastating results. Surveys mailed to households and then returned may be very biased (they may favor a certain group). It is better for the person conducting the survey to select the sample respondents.

True random sampling is done with replacement. That is, once a member is picked, that member goes back into the population and thus may be chosen more than once. However for practical reasons, in most populations, simple random sampling is done without replacement. Surveys are typically done without replacement. That is, a member of the population may be chosen only once. Most samples are taken from large populations and the sample tends to be small in comparison to the population. Since this is the case, sampling without replacement is approximately the same as sampling with replacement because the chance of picking the same individual more than once with replacement is very low.

In a college population of 10,000 people, suppose you want to pick a sample of 1,000 randomly for a survey. For any particular sample of 1,000, if you are sampling with replacement,

• the chance of picking the first person is 1,000 out of 10,000 (0.1000);

• the chance of picking a different second person for this sample is 999 out of 10,000 (0.0999);

Chapter 1 | Sampling and Data 19

 

 

• the chance of picking the same person again is 1 out of 10,000 (very low).

If you are sampling without replacement,

• the chance of picking the first person for any particular sample is 1000 out of 10,000 (0.1000);

• the chance of picking a different second person is 999 out of 9,999 (0.0999);

• you do not replace the first person before picking the next person.

Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the decimal answers to four decimal places. To four decimal places, these numbers are equivalent (0.0999).

Sampling without replacement instead of sampling with replacement becomes a mathematical issue only when the population is small. For example, if the population is 25 people, the sample is ten, and you are sampling with replacement for any particular sample, then the chance of picking the first person is ten out of 25, and the chance of picking a different second person is nine out of 25 (you replace the first person).

If you sample without replacement, then the chance of picking the first person is ten out of 25, and then the chance of picking the second person (who is different) is nine out of 24 (you do not replace the first person).

Compare the fractions 9/25 and 9/24. To four decimal places, 9/25 = 0.3600 and 9/24 = 0.3750. To four decimal places, these numbers are not equivalent.

When you analyze data, it is important to be aware of sampling errors and nonsampling errors. The actual process of sampling causes sampling errors. For example, the sample may not be large enough. Factors not related to the sampling process cause nonsampling errors. A defective counting device can cause a nonsampling error.

In reality, a sample will never be exactly representative of the population so there will always be some sampling error. As a rule, the larger the sample, the smaller the sampling error.

In statistics, a sampling bias is created when a sample is collected from a population and some members of the population are not as likely to be chosen as others (remember, each member of the population should have an equally likely chance of being chosen). When a sampling bias happens, there can be incorrect conclusions drawn about the population that is being studied.

Critical Evaluation We need to evaluate the statistical studies we read about critically and analyze them before accepting the results of the studies. Common problems to be aware of include

• Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased. Biased samples that are not representative of the population give results that are inaccurate and not valid.

• Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.

• Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions. Examples: crash testing cars or medical testing for rare conditions

• Undue influence: collecting data or asking questions in a way that influences the response

• Non-response or refusal of subject to participate: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.

• Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.

• Self-funded or self-interest studies: A study performed by a person or organization in order to support their claim. Is the study impartial? Read the study carefully to evaluate the work. Do not automatically assume that the study is good, but do not automatically assume the study is bad either. Evaluate it on its merits and the work done.

• Misleading use of data: improperly displayed graphs, incomplete data, or lack of context

• Confounding: When the effects of multiple factors on a response cannot be separated. Confounding makes it difficult or impossible to draw valid conclusions about the effect of each factor.

COLLABORATIVE EXERCISE

As a class, determine whether or not the following samples are representative. If they are not, discuss the reasons.

20 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

1. To find the average GPA of all students in a university, use all honor students at the university as the sample.

2. To find out the most popular cereal among young people under the age of ten, stand outside a large supermarket for three hours and speak to every twentieth child under age ten who enters the supermarket.

3. To find the average annual income of all adults in the United States, sample U.S. congressmen. Create a cluster sample by considering each state as a stratum (group). By using simple random sampling, select states to be part of the cluster. Then survey every U.S. congressman in the cluster.

4. To determine the proportion of people taking public transportation to work, survey 20 people in New York City. Conduct the survey by sitting in Central Park on a bench and interviewing every person who sits next to you.

5. To determine the average cost of a two-day stay in a hospital in Massachusetts, survey 100 hospitals across the state using simple random sampling.

Example 1.11

A study is done to determine the average tuition that San Jose State undergraduate students pay per semester. Each student in the following samples is asked how much tuition he or she paid for the Fall semester. What is the type of sampling in each case?

a. A sample of 100 undergraduate San Jose State students is taken by organizing the students’ names by classification (freshman, sophomore, junior, or senior), and then selecting 25 students from each.

b. A random number generator is used to select a student from the alphabetical listing of all undergraduate students in the Fall semester. Starting with that student, every 50th student is chosen until 75 students are included in the sample.

c. A completely random method is used to select 75 students. Each undergraduate student in the fall semester has the same probability of being chosen at any stage of the sampling process.

d. The freshman, sophomore, junior, and senior years are numbered one, two, three, and four, respectively. A random number generator is used to pick two of those years. All students in those two years are in the sample.

e. An administrative assistant is asked to stand in front of the library one Wednesday and to ask the first 100 undergraduate students he encounters what they paid for tuition the Fall semester. Those 100 students are the sample.

Solution 1.11 a. stratified; b. systematic; c. simple random; d. cluster; e. convenience

Chapter 1 | Sampling and Data 21

 

 

1.11 You are going to use the random number generator to generate different types of samples from the data. This table displays six sets of quiz scores (each quiz counts 10 points) for an elementary statistics class.

#1 #2 #3 #4 #5 #6

5 7 10 9 8 3

10 5 9 8 7 6

9 10 8 6 7 9

9 10 10 9 8 9

7 8 9 5 7 4

9 9 9 10 8 7

7 7 10 9 8 8

8 8 9 10 8 8

9 7 8 7 7 8

8 8 10 9 8 7

Table 1.6

Instructions: Use the Random Number Generator to pick samples.

1. Create a stratified sample by column. Pick three quiz scores randomly from each column.

◦ Number each row one through ten.

◦ On your calculator, press Math and arrow over to PRB.

◦ For column 1, Press 5:randInt( and enter 1,10). Press ENTER. Record the number. Press ENTER 2 more times (even the repeats). Record these numbers. Record the three quiz scores in column one that correspond to these three numbers.

◦ Repeat for columns two through six.

◦ These 18 quiz scores are a stratified sample.

2. Create a cluster sample by picking two of the columns. Use the column numbers: one through six.

◦ Press MATH and arrow over to PRB.

◦ Press 5:randInt( and enter 1,6). Press ENTER. Record the number. Press ENTER and record that number.

◦ The two numbers are for two of the columns.

◦ The quiz scores (20 of them) in these 2 columns are the cluster sample.

3. Create a simple random sample of 15 quiz scores.

◦ Use the numbering one through 60.

◦ Press MATH. Arrow over to PRB. Press 5:randInt( and enter 1, 60).

◦ Press ENTER 15 times and record the numbers.

◦ Record the quiz scores that correspond to these numbers.

◦ These 15 quiz scores are the systematic sample.

4. Create a systematic sample of 12 quiz scores.

◦ Use the numbering one through 60.

22 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

◦ Press MATH. Arrow over to PRB. Press 5:randInt( and enter 1, 60).

◦ Press ENTER. Record the number and the first quiz score. From that number, count ten quiz scores and record that quiz score. Keep counting ten quiz scores and recording the quiz score until you have a sample of 12 quiz scores. You may wrap around (go back to the beginning).

Example 1.12

Determine the type of sampling used (simple random, stratified, systematic, cluster, or convenience).

a. A soccer coach selects six players from a group of boys aged eight to ten, seven players from a group of boys aged 11 to 12, and three players from a group of boys aged 13 to 14 to form a recreational soccer team.

b. A pollster interviews all human resource personnel in five different high tech companies.

c. A high school educational researcher interviews 50 high school female teachers and 50 high school male teachers.

d. A medical researcher interviews every third cancer patient from a list of cancer patients at a local hospital.

e. A high school counselor uses a computer to generate 50 random numbers and then picks students whose names correspond to the numbers.

f. A student interviews classmates in his algebra class to determine how many pairs of jeans a student owns, on the average.

Solution 1.12 a. stratified; b. cluster; c. stratified; d. systematic; e. simple random; f.convenience

1.12 Determine the type of sampling used (simple random, stratified, systematic, cluster, or convenience). A high school principal polls 50 freshmen, 50 sophomores, 50 juniors, and 50 seniors regarding policy changes for after school activities.

If we were to examine two samples representing the same population, even if we used random sampling methods for the samples, they would not be exactly the same. Just as there is variation in data, there is variation in samples. As you become accustomed to sampling, the variability will begin to seem natural.

Example 1.13

Suppose ABC College has 10,000 part-time students (the population). We are interested in the average amount of money a part-time student spends on books in the fall term. Asking all 10,000 students is an almost impossible task.

Suppose we take two different samples.

First, we use convenience sampling and survey ten students from a first term organic chemistry class. Many of these students are taking first term calculus in addition to the organic chemistry class. The amount of money they spend on books is as follows:

$128; $87; $173; $116; $130; $204; $147; $189; $93; $153

The second sample is taken using a list of senior citizens who take P.E. classes and taking every fifth senior citizen on the list, for a total of ten senior citizens. They spend:

Chapter 1 | Sampling and Data 23

 

 

$50; $40; $36; $15; $50; $100; $40; $53; $22; $22

It is unlikely that any student is in both samples.

a. Do you think that either of these samples is representative of (or is characteristic of) the entire 10,000 part-time student population?

Solution 1.13 a. No. The first sample probably consists of science-oriented students. Besides the chemistry course, some of them are also taking first-term calculus. Books for these classes tend to be expensive. Most of these students are, more than likely, paying more than the average part-time student for their books. The second sample is a group of senior citizens who are, more than likely, taking courses for health and interest. The amount of money they spend on books is probably much less than the average parttime student. Both samples are biased. Also, in both cases, not all students have a chance to be in either sample.

b. Since these samples are not representative of the entire population, is it wise to use the results to describe the entire population?

Solution 1.13 b. No. For these samples, each member of the population did not have an equally likely chance of being chosen.

Now, suppose we take a third sample. We choose ten different part-time students from the disciplines of chemistry, math, English, psychology, sociology, history, nursing, physical education, art, and early childhood development. (We assume that these are the only disciplines in which part-time students at ABC College are enrolled and that an equal number of part-time students are enrolled in each of the disciplines.) Each student is chosen using simple random sampling. Using a calculator, random numbers are generated and a student from a particular discipline is selected if he or she has a corresponding number. The students spend the following amounts:

$180; $50; $150; $85; $260; $75; $180; $200; $200; $150

c. Is the sample biased?

Solution 1.13 c. The sample is unbiased, but a larger sample would be recommended to increase the likelihood that the sample will be close to representative of the population. However, for a biased sampling technique, even a large sample runs the risk of not being representative of the population.

Students often ask if it is “good enough” to take a sample, instead of surveying the entire population. If the survey is done well, the answer is yes.

1.13 A local radio station has a fan base of 20,000 listeners. The station wants to know if its audience would prefer more music or more talk shows. Asking all 20,000 listeners is an almost impossible task.

The station uses convenience sampling and surveys the first 200 people they meet at one of the station’s music concert events. 24 people said they’d prefer more talk shows, and 176 people said they’d prefer more music.

Do you think that this sample is representative of (or is characteristic of) the entire 20,000 listener population?

Variation in Data Variation is present in any set of data. For example, 16-ounce cans of beverage may contain more or less than 16 ounces of liquid. In one study, eight 16 ounce cans were measured and produced the following amount (in ounces) of beverage:

15.8; 16.1; 15.2; 14.8; 15.8; 15.9; 16.0; 15.5

Measurements of the amount of beverage in a 16-ounce can may vary because different people make the measurements or because the exact amount, 16 ounces of liquid, was not put into the cans. Manufacturers regularly run tests to determine if

24 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

the amount of beverage in a 16-ounce can falls within the desired range.

Be aware that as you take data, your data may vary somewhat from the data someone else is taking for the same purpose. This is completely natural. However, if two or more of you are taking the same data and get very different results, it is time for you and the others to reevaluate your data-taking methods and your accuracy.

Variation in Samples It was mentioned previously that two or more samples from the same population, taken randomly, and having close to the same characteristics of the population will likely be different from each other. Suppose Doreen and Jung both decide to study the average amount of time students at their college sleep each night. Doreen and Jung each take samples of 500 students. Doreen uses systematic sampling and Jung uses cluster sampling. Doreen’s sample will be different from Jung’s sample. Even if Doreen and Jung used the same sampling method, in all likelihood their samples would be different. Neither would be wrong, however.

Think about what contributes to making Doreen’s and Jung’s samples different.

If Doreen and Jung took larger samples (i.e. the number of data values is increased), their sample results (the average amount of time a student sleeps) might be closer to the actual population average. But still, their samples would be, in all likelihood, different from each other. This variability in samples cannot be stressed enough.

Size of a Sample The size of a sample (often called the number of observations) is important. The examples you have seen in this book so far have been small. Samples of only a few hundred observations, or even smaller, are sufficient for many purposes. In polling, samples that are from 1,200 to 1,500 observations are considered large enough and good enough if the survey is random and is well done. You will learn why when you study confidence intervals.

Be aware that many large samples are biased. For example, call-in surveys are invariably biased, because people choose to respond or not.

Chapter 1 | Sampling and Data 25

 

 

Divide into groups of two, three, or four. Your instructor will give each group one six-sided die. Try this experiment twice. Roll one fair die (six-sided) 20 times. Record the number of ones, twos, threes, fours, fives, and sixes you get in Table 1.7 and Table 1.8 (“frequency” is the number of times a particular face of the die occurs):

Face on Die Frequency

1

2

3

4

5

6

Table 1.7 First Experiment (20 rolls)

Face on Die Frequency

1

2

3

4

5

6

Table 1.8 Second Experiment (20 rolls)

Did the two experiments have the same results? Probably not. If you did the experiment a third time, do you expect the results to be identical to the first or second experiment? Why or why not?

Which experiment had the correct results? They both did. The job of the statistician is to see through the variability and draw appropriate conclusions.

1.3 | Frequency, Frequency Tables, and Levels of Measurement Once you have a set of data, you will need to organize it so that you can analyze how frequently each datum occurs in the set. However, when calculating the frequency, you may need to round your answers so that they are as precise as possible.

Answers and Rounding Off A simple way to round off answers is to carry your final answer one more decimal place than was present in the original data. Round off only the final answer. Do not round off any intermediate results, if possible. If it becomes necessary to round off intermediate results, carry them to at least twice as many decimal places as the final answer. For example, the average of the three quiz scores four, six, and nine is 6.3, rounded off to the nearest tenth, because the data are whole numbers. Most answers will be rounded off in this manner.

26 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

It is not necessary to reduce most fractions in this course. Especially in Probability Topics, the chapter on probability, it is more helpful to leave an answer as an unreduced fraction.

Levels of Measurement The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. Not every statistical operation can be used with every set of data. Data can be classified into four levels of measurement. They are (from lowest to highest level):

• Nominal scale level

• Ordinal scale level

• Interval scale level

• Ratio scale level

Data that is measured using a nominal scale is qualitative(categorical). Categories, colors, names, labels and favorite foods along with yes or no responses are examples of nominal level data. Nominal scale data are not ordered. For example, trying to classify people according to their favorite food does not make any sense. Putting pizza first and sushi second is not meaningful.

Smartphone companies are another example of nominal scale data. The data are the names of the companies that make smartphones, but there is no agreed upon order of these brands, even though people may have personal preferences. Nominal scale data cannot be used in calculations.

Data that is measured using an ordinal scale is similar to nominal scale data but there is a big difference. The ordinal scale data can be ordered. An example of ordinal scale data is a list of the top five national parks in the United States. The top five national parks in the United States can be ranked from one to five but we cannot measure differences between the data.

Another example of using the ordinal scale is a cruise survey where the responses to questions about the cruise are “excellent,” “good,” “satisfactory,” and “unsatisfactory.” These responses are ordered from the most desired response to the least desired. But the differences between two pieces of data cannot be measured. Like the nominal scale data, ordinal scale data cannot be used in calculations.

Data that is measured using the interval scale is similar to ordinal level data because it has a definite ordering but there is a difference between data. The differences between interval scale data can be measured though the data does not have a starting point.

Temperature scales like Celsius (C) and Fahrenheit (F) are measured by using the interval scale. In both temperature measurements, 40° is equal to 100° minus 60°. Differences make sense. But 0 degrees does not because, in both scales, 0 is not the absolute lowest temperature. Temperatures like -10° F and -15° C exist and are colder than 0.

Interval level data can be used in calculations, but one type of comparison cannot be done. 80° C is not four times as hot as 20° C (nor is 80° F four times as hot as 20° F). There is no meaning to the ratio of 80 to 20 (or four to one).

Data that is measured using the ratio scale takes care of the ratio problem and gives you the most information. Ratio scale data is like interval scale data, but it has a 0 point and ratios can be calculated. For example, four multiple choice statistics final exam scores are 80, 68, 20 and 92 (out of a possible 100 points). The exams are machine-graded.

The data can be put in order from lowest to highest: 20, 68, 80, 92.

The differences between the data have meaning. The score 92 is more than the score 68 by 24 points. Ratios can be calculated. The smallest score is 0. So 80 is four times 20. The score of 80 is four times better than the score of 20.

Frequency Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows: 5; 6; 3; 3; 2; 4; 7; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3.

Table 1.9 lists the different data values in ascending order and their frequencies.

DATA VALUE FREQUENCY

2 3

Table 1.9 Frequency Table of Student Work Hours

Chapter 1 | Sampling and Data 27

 

 

DATA VALUE FREQUENCY

3 5

4 3

5 6

6 2

7 1

Table 1.9 Frequency Table of Student Work Hours

A frequency is the number of times a value of the data occurs. According to Table 1.9, there are three students who work two hours, five students who work three hours, and so on. The sum of the values in the frequency column, 20, represents the total number of students included in the sample.

A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. To find the relative frequencies, divide each frequency by the total number of students in the sample–in this case, 20. Relative frequencies can be written as fractions, percents, or decimals.

DATA VALUE FREQUENCY RELATIVE FREQUENCY

2 3 3 20 or 0.15

3 5 5 20 or 0.25

4 3 3 20 or 0.15

5 6 6 20 or 0.30

6 2 2 20 or 0.10

7 1 1 20 or 0.05

Table 1.10 Frequency Table of Student Work Hours with Relative Frequencies

The sum of the values in the relative frequency column of Table 1.10 is 2020 , or 1.

Cumulative relative frequency is the accumulation of the previous relative frequencies. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row, as shown in Table 1.11.

DATA VALUE FREQUENCY RELATIVEFREQUENCY CUMULATIVE RELATIVE FREQUENCY

2 3 3 20 or 0.15 0.15

3 5 5 20 or 0.25 0.15 + 0.25 = 0.40

Table 1.11 Frequency Table of Student Work Hours with Relative and Cumulative Relative Frequencies

28 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

DATA VALUE FREQUENCY RELATIVEFREQUENCY CUMULATIVE RELATIVE FREQUENCY

4 3 3 20 or 0.15 0.40 + 0.15 = 0.55

5 6 6 20 or 0.30 0.55 + 0.30 = 0.85

6 2 2 20 or 0.10 0.85 + 0.10 = 0.95

7 1 1 20 or 0.05 0.95 + 0.05 = 1.00

Table 1.11 Frequency Table of Student Work Hours with Relative and Cumulative Relative Frequencies

The last entry of the cumulative relative frequency column is one, indicating that one hundred percent of the data has been accumulated.

NOTE

Because of rounding, the relative frequency column may not always sum to one, and the last entry in the cumulative relative frequency column may not be one. However, they each should be close to one.

Table 1.12 represents the heights, in inches, of a sample of 100 male semiprofessional soccer players.

HEIGHTS (INCHES) FREQUENCY

RELATIVE FREQUENCY

CUMULATIVE RELATIVE FREQUENCY

59.95–61.95 5 5

100 = 0.05 0.05

61.95–63.95 3 3

100 = 0.03 0.05 + 0.03 = 0.08

63.95–65.95 15 15 100 = 0.15 0.08 + 0.15 = 0.23

65.95–67.95 40 40 100 = 0.40 0.23 + 0.40 = 0.63

67.95–69.95 17 17 100 = 0.17 0.63 + 0.17 = 0.80

69.95–71.95 12 12 100 = 0.12 0.80 + 0.12 = 0.92

71.95–73.95 7 7

100 = 0.07 0.92 + 0.07 = 0.99

73.95–75.95 1 1

100 = 0.01 0.99 + 0.01 = 1.00

Total = 100 Total = 1.00

Table 1.12 Frequency Table of Soccer Player Height

The data in this table have been grouped into the following intervals:

• 59.95 to 61.95 inches

Chapter 1 | Sampling and Data 29

 

 

• 61.95 to 63.95 inches

• 63.95 to 65.95 inches

• 65.95 to 67.95 inches

• 67.95 to 69.95 inches

• 69.95 to 71.95 inches

• 71.95 to 73.95 inches

• 73.95 to 75.95 inches

NOTE

This example is used again in Descriptive Statistics, where the method used to compute the intervals will be explained.

In this sample, there are five players whose heights fall within the interval 59.95–61.95 inches, three players whose heights fall within the interval 61.95–63.95 inches, 15 players whose heights fall within the interval 63.95–65.95 inches, 40 players whose heights fall within the interval 65.95–67.95 inches, 17 players whose heights fall within the interval 67.95–69.95 inches, 12 players whose heights fall within the interval 69.95–71.95, seven players whose heights fall within the interval 71.95–73.95, and one player whose heights fall within the interval 73.95–75.95. All heights fall between the endpoints of an interval and not at the endpoints.

Example 1.14

From Table 1.12, find the percentage of heights that are less than 65.95 inches.

Solution 1.14 If you look at the first, second, and third rows, the heights are all less than 65.95 inches. There are 5 + 3 + 15 = 23 players whose heights are less than 65.95 inches. The percentage of heights less than 65.95 inches is then 23100 or 23%. This percentage is the cumulative relative frequency entry in the third row.

1.14 Table 1.13 shows the amount, in inches, of annual rainfall in a sample of towns.

Rainfall (Inches) Frequency Relative Frequency Cumulative Relative Frequency

2.95–4.97 6 6 50 = 0.12 0.12

4.97–6.99 7 7 50 = 0.14 0.12 + 0.14 = 0.26

6.99–9.01 15 15 50 = 0.30 0.26 + 0.30 = 0.56

9.01–11.03 8 8 50 = 0.16 0.56 + 0.16 = 0.72

11.03–13.05 9 9 50 = 0.18 0.72 + 0.18 = 0.90

Table 1.13

30 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Rainfall (Inches) Frequency Relative Frequency Cumulative Relative Frequency

13.05–15.07 5 5 50 = 0.10 0.90 + 0.10 = 1.00

Total = 50 Total = 1.00

Table 1.13

From Table 1.13, find the percentage of rainfall that is less than 9.01 inches.

Example 1.15

From Table 1.12, find the percentage of heights that fall between 61.95 and 65.95 inches.

Solution 1.15 Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 0.18 or 18%.

1.15 From Table 1.13, find the percentage of rainfall that is between 6.99 and 13.05 inches.

Example 1.16

Use the heights of the 100 male semiprofessional soccer players in Table 1.12. Fill in the blanks and check your answers.

a. The percentage of heights that are from 67.95 to 71.95 inches is: ____.

b. The percentage of heights that are from 67.95 to 73.95 inches is: ____.

c. The percentage of heights that are more than 65.95 inches is: ____.

d. The number of players in the sample who are between 61.95 and 71.95 inches tall is: ____.

e. What kind of data are the heights?

f. Describe how you could gather this data (the heights) so that the data are characteristic of all male semiprofessional soccer players.

Remember, you count frequencies. To find the relative frequency, divide the frequency by the total number of data values. To find the cumulative relative frequency, add all of the previous relative frequencies to the relative frequency for the current row.

Solution 1.16 a. 29%

b. 36%

c. 77%

d. 87

e. quantitative continuous

f. get rosters from each team and choose a simple random sample from each

Chapter 1 | Sampling and Data 31

 

 

1.16 From Table 1.13, find the number of towns that have rainfall between 2.95 and 9.01 inches.

In your class, have someone conduct a survey of the number of siblings (brothers and sisters) each student has. Create a frequency table. Add to it a relative frequency column and a cumulative relative frequency column. Answer the following questions:

1. What percentage of the students in your class have no siblings?

2. What percentage of the students have from one to three siblings?

3. What percentage of the students have fewer than three siblings?

32 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Example 1.17

Nineteen people were asked how many miles, to the nearest mile, they commute to work each day. The data are as follows: 2; 5; 7; 3; 2; 10; 18; 15; 20; 7; 10; 18; 5; 12; 13; 12; 4; 5; 10. Table 1.14 was produced:

DATA FREQUENCY RELATIVEFREQUENCY

CUMULATIVE RELATIVE FREQUENCY

3 3 319 0.1579

4 1 119 0.2105

5 3 319 0.1579

7 2 219 0.2632

10 3 419 0.4737

12 2 219 0.7895

13 1 119 0.8421

15 1 119 0.8948

18 1 119 0.9474

20 1 119 1.0000

Table 1.14 Frequency of Commuting Distances

a. Is the table correct? If it is not correct, what is wrong?

b. True or False: Three percent of the people surveyed commute three miles. If the statement is not correct, what should it be? If the table is incorrect, make the corrections.

c. What fraction of the people surveyed commute five or seven miles?

d. What fraction of the people surveyed commute 12 miles or more? Less than 12 miles? Between five and 13 miles (not including five and 13 miles)?

Solution 1.17 a. No. The frequency column sums to 18, not 19. Not all cumulative relative frequencies are correct.

b. False. The frequency for three miles should be one; for two miles (left out), two. The cumulative relative frequency column should read: 0.1052, 0.1579, 0.2105, 0.3684, 0.4737, 0.6316, 0.7368, 0.7895, 0.8421, 0.9474, 1.0000.

c. 519

d. 719 , 12 19 ,

7 19

Chapter 1 | Sampling and Data 33

 

 

1.17 Table 1.13 represents the amount, in inches, of annual rainfall in a sample of towns. What fraction of towns surveyed get between 11.03 and 13.05 inches of rainfall each year?

Example 1.18

Table 1.15 contains the total number of deaths worldwide as a result of earthquakes for the period from 2000 to 2012.

Year Total Number of Deaths

2000 231

2001 21,357

2002 11,685

2003 33,819

2004 228,802

2005 88,003

2006 6,605

2007 712

2008 88,011

2009 1,790

2010 320,120

2011 21,953

2012 768

Total 823,856

Table 1.15

Answer the following questions.

a. What is the frequency of deaths measured from 2006 through 2009?

b. What percentage of deaths occurred after 2009?

c. What is the relative frequency of deaths that occurred in 2003 or earlier?

d. What is the percentage of deaths that occurred in 2004?

e. What kind of data are the numbers of deaths?

f. The Richter scale is used to quantify the energy produced by an earthquake. Examples of Richter scale numbers are 2.3, 4.0, 6.1, and 7.0. What kind of data are these numbers?

Solution 1.18 a. 97,118 (11.8%)

b. 41.6%

c. 67,092/823,356 or 0.081 or 8.1 %

d. 27.8%

34 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

e. Quantitative discrete

f. Quantitative continuous

1.18 Table 1.16 contains the total number of fatal motor vehicle traffic crashes in the United States for the period from 1994 to 2011.

Year Total Number of Crashes Year Total Number of Crashes

1994 36,254 2004 38,444

1995 37,241 2005 39,252

1996 37,494 2006 38,648

1997 37,324 2007 37,435

1998 37,107 2008 34,172

1999 37,140 2009 30,862

2000 37,526 2010 30,296

2001 37,862 2011 29,757

2002 38,491 Total 653,782

2003 38,477

Table 1.16

Answer the following questions.

a. What is the frequency of deaths measured from 2000 through 2004?

b. What percentage of deaths occurred after 2006?

c. What is the relative frequency of deaths that occurred in 2000 or before?

d. What is the percentage of deaths that occurred in 2011?

e. What is the cumulative relative frequency for 2006? Explain what this number tells you about the data.

1.4 | Experimental Design and Ethics Does aspirin reduce the risk of heart attacks? Is one brand of fertilizer more effective at growing roses than another? Is fatigue as dangerous to a driver as the influence of alcohol? Questions like these are answered using randomized experiments. In this module, you will learn important aspects of experimental design. Proper study design ensures the production of reliable, accurate data.

The purpose of an experiment is to investigate the relationship between two variables. When one variable causes change in another, we call the first variable the explanatory variable. The affected variable is called the response variable. In a randomized experiment, the researcher manipulates values of the explanatory variable and measures the resulting changes in the response variable. The different values of the explanatory variable are called treatments. An experimental unit is a single object or individual to be measured.

You want to investigate the effectiveness of vitamin E in preventing disease. You recruit a group of subjects and ask them if they regularly take vitamin E. You notice that the subjects who take vitamin E exhibit better health on average than those who do not. Does this prove that vitamin E is effective in disease prevention? It does not. There are many differences

Chapter 1 | Sampling and Data 35

 

 

between the two groups compared in addition to vitamin E consumption. People who take vitamin E regularly often take other steps to improve their health: exercise, diet, other vitamin supplements, choosing not to smoke. Any one of these factors could be influencing health. As described, this study does not prove that vitamin E is the key to disease prevention.

Additional variables that can cloud a study are called lurking variables. In order to prove that the explanatory variable is causing a change in the response variable, it is necessary to isolate the explanatory variable. The researcher must design her experiment in such a way that there is only one difference between groups being compared: the planned treatments. This is accomplished by the random assignment of experimental units to treatment groups. When subjects are assigned treatments randomly, all of the potential lurking variables are spread equally among the groups. At this point the only difference between groups is the one imposed by the researcher. Different outcomes measured in the response variable, therefore, must be a direct result of the different treatments. In this way, an experiment can prove a cause-and-effect connection between the explanatory and response variables.

The power of suggestion can have an important influence on the outcome of an experiment. Studies have shown that the expectation of the study participant can be as important as the actual medication. In one study of performance-enhancing drugs, researchers noted:

Results showed that believing one had taken the substance resulted in [performance] times almost as fast as those associated with consuming the drug itself. In contrast, taking the drug without knowledge yielded no significant performance increment.[1]

When participation in a study prompts a physical response from a participant, it is difficult to isolate the effects of the explanatory variable. To counter the power of suggestion, researchers set aside one treatment group as a control group. This group is given a placebo treatment–a treatment that cannot influence the response variable. The control group helps researchers balance the effects of being in an experiment with the effects of the active treatments. Of course, if you are participating in a study and you know that you are receiving a pill which contains no actual medication, then the power of suggestion is no longer a factor. Blinding in a randomized experiment preserves the power of suggestion. When a person involved in a research study is blinded, he does not know who is receiving the active treatment(s) and who is receiving the placebo treatment. A double-blind experiment is one in which both the subjects and the researchers involved with the subjects are blinded.

Example 1.19

Researchers want to investigate whether taking aspirin regularly reduces the risk of heart attack. Four hundred men between the ages of 50 and 84 are recruited as participants. The men are divided randomly into two groups: one group will take aspirin, and the other group will take a placebo. Each man takes one pill each day for three years, but he does not know whether he is taking aspirin or the placebo. At the end of the study, researchers count the number of men in each group who have had heart attacks.

Identify the following values for this study: population, sample, experimental units, explanatory variable, response variable, treatments.

Solution 1.19 The population is men aged 50 to 84. The sample is the 400 men who participated. The experimental units are the individual men in the study. The explanatory variable is oral medication. The treatments are aspirin and a placebo. The response variable is whether a subject had a heart attack.

Example 1.20

The Smell & Taste Treatment and Research Foundation conducted a study to investigate whether smell can affect learning. Subjects completed mazes multiple times while wearing masks. They completed the pencil and paper mazes three times wearing floral-scented masks, and three times with unscented masks. Participants were

1. McClung, M. Collins, D. “Because I know it will!”: placebo effects of an ergogenic aid on athletic performance. Journal of Sport & Exercise Psychology. 2007 Jun. 29(3):382-94. Web. April 30, 2013.

36 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

assigned at random to wear the floral mask during the first three trials or during the last three trials. For each trial, researchers recorded the time it took to complete the maze and the subject’s impression of the mask’s scent: positive, negative, or neutral.

a. Describe the explanatory and response variables in this study.

b. What are the treatments?

c. Identify any lurking variables that could interfere with this study.

d. Is it possible to use blinding in this study?

Solution 1.20 a. The explanatory variable is scent, and the response variable is the time it takes to complete the maze.

b. There are two treatments: a floral-scented mask and an unscented mask.

c. All subjects experienced both treatments. The order of treatments was randomly assigned so there were no differences between the treatment groups. Random assignment eliminates the problem of lurking variables.

d. Subjects will clearly know whether they can smell flowers or not, so subjects cannot be blinded in this study. Researchers timing the mazes can be blinded, though. The researcher who is observing a subject will not know which mask is being worn.

Example 1.21

A researcher wants to study the effects of birth order on personality. Explain why this study could not be conducted as a randomized experiment. What is the main problem in a study that cannot be designed as a randomized experiment?

Solution 1.21 The explanatory variable is birth order. You cannot randomly assign a person’s birth order. Random assignment eliminates the impact of lurking variables. When you cannot assign subjects to treatment groups at random, there will be differences between the groups other than the explanatory variable.

1.21 You are concerned about the effects of texting on driving performance. Design a study to test the response time of drivers while texting and while driving only. How many seconds does it take for a driver to respond when a leading car hits the brakes?

a. Describe the explanatory and response variables in the study.

b. What are the treatments?

c. What should you consider when selecting participants?

d. Your research partner wants to divide participants randomly into two groups: one to drive without distraction and one to text and drive simultaneously. Is this a good idea? Why or why not?

e. Identify any lurking variables that could interfere with this study.

f. How can blinding be used in this study?

Ethics The widespread misuse and misrepresentation of statistical information often gives the field a bad name. Some say that “numbers don’t lie,” but the people who use numbers to support their claims often do.

A recent investigation of famous social psychologist, Diederik Stapel, has led to the retraction of his articles from some

Chapter 1 | Sampling and Data 37

 

 

of the world’s top journals including Journal of Experimental Social Psychology, Social Psychology, Basic and Applied Social Psychology, British Journal of Social Psychology, and the magazine Science. Diederik Stapel is a former professor at Tilburg University in the Netherlands. Over the past two years, an extensive investigation involving three universities where Stapel has worked concluded that the psychologist is guilty of fraud on a colossal scale. Falsified data taints over 55 papers he authored and 10 Ph.D. dissertations that he supervised.

Stapel did not deny that his deceit was driven by ambition. But it was more complicated than that, he told me. He insisted that he loved social psychology but had been frustrated by the messiness of experimental data, which rarely led to clear conclusions. His lifelong obsession with elegance and order, he said, led him to concoct sexy results that journals found attractive. “It was a quest for aesthetics, for beauty—instead of the truth,” he said. He described his behavior as an addiction that drove him to carry out acts of increasingly daring fraud, like a junkie seeking a bigger and better high.[2]

The committee investigating Stapel concluded that he is guilty of several practices including:

• creating datasets, which largely confirmed the prior expectations,

• altering data in existing datasets,

• changing measuring instruments without reporting the change, and

• misrepresenting the number of experimental subjects.

Clearly, it is never acceptable to falsify data the way this researcher did. Sometimes, however, violations of ethics are not as easy to spot.

Researchers have a responsibility to verify that proper methods are being followed. The report describing the investigation of Stapel’s fraud states that, “statistical flaws frequently revealed a lack of familiarity with elementary statistics.”[3] Many of Stapel’s co-authors should have spotted irregularities in his data. Unfortunately, they did not know very much about statistical analysis, and they simply trusted that he was collecting and reporting data properly.

Many types of statistical fraud are difficult to spot. Some researchers simply stop collecting data once they have just enough to prove what they had hoped to prove. They don’t want to take the chance that a more extensive study would complicate their lives by producing data contradicting their hypothesis.

Professional organizations, like the American Statistical Association, clearly define expectations for researchers. There are even laws in the federal code about the use of research data.

When a statistical study uses human participants, as in medical studies, both ethics and the law dictate that researchers should be mindful of the safety of their research subjects. The U.S. Department of Health and Human Services oversees federal regulations of research studies with the aim of protecting participants. When a university or other research institution engages in research, it must ensure the safety of all human subjects. For this reason, research institutions establish oversight committees known as Institutional Review Boards (IRB). All planned studies must be approved in advance by the IRB. Key protections that are mandated by law include the following:

• Risks to participants must be minimized and reasonable with respect to projected benefits.

• Participants must give informed consent. This means that the risks of participation must be clearly explained to the subjects of the study. Subjects must consent in writing, and researchers are required to keep documentation of their consent.

• Data collected from individuals must be guarded carefully to protect their privacy.

These ideas may seem fundamental, but they can be very difficult to verify in practice. Is removing a participant’s name from the data record sufficient to protect privacy? Perhaps the person’s identity could be discovered from the data that remains. What happens if the study does not proceed as planned and risks arise that were not anticipated? When is informed consent really necessary? Suppose your doctor wants a blood sample to check your cholesterol level. Once the sample has been tested, you expect the lab to dispose of the remaining blood. At that point the blood becomes biological waste. Does a researcher have the right to take it for use in a study?

It is important that students of statistics take time to consider the ethical questions that arise in statistical studies. How prevalent is fraud in statistical studies? You might be surprised—and disappointed. There is a website (http://www.retractionwatch.com) dedicated to cataloging retractions of study articles that have been proven

2. Yudhijit Bhattacharjee, “The Mind of a Con Man,” Magazine, New York Times, April 26, 2013. Available online at: http://www.nytimes.com/2013/04/28/magazine/diederik-stapels-audacious-academic-fraud.html?src=dayp&_r=2& (accessed May 1, 2013). 3. “Flawed Science: The Fraudulent Research Practices of Social Psychologist Diederik Stapel,” Tillburg University, November 28, 2012, http://www.tilburguniversity.edu/upload/064a10cd- bce5-4385-b9ff-05b840caeae6_120695_Rapp_nov_2012_UK_web.pdf (accessed May 1, 2013).

38 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

fraudulent. A quick glance will show that the misuse of statistics is a bigger problem than most people realize.

Vigilance against fraud requires knowledge. Learning the basic theory of statistics will empower you to analyze statistical studies critically.

Example 1.22

Describe the unethical behavior in each example and describe how it could impact the reliability of the resulting data. Explain how the problem should be corrected.

A researcher is collecting data in a community.

a. She selects a block where she is comfortable walking because she knows many of the people living on the street.

b. No one seems to be home at four houses on her route. She does not record the addresses and does not return at a later time to try to find residents at home.

c. She skips four houses on her route because she is running late for an appointment. When she gets home, she fills in the forms by selecting random answers from other residents in the neighborhood.

Solution 1.22 a. By selecting a convenient sample, the researcher is intentionally selecting a sample that could be biased.

Claiming that this sample represents the community is misleading. The researcher needs to select areas in the community at random.

b. Intentionally omitting relevant data will create bias in the sample. Suppose the researcher is gathering information about jobs and child care. By ignoring people who are not home, she may be missing data from working families that are relevant to her study. She needs to make every effort to interview all members of the target sample.

c. It is never acceptable to fake data. Even though the responses she uses are “real” responses provided by other participants, the duplication is fraudulent and can create bias in the data. She needs to work diligently to interview everyone on her route.

1.22 Describe the unethical behavior, if any, in each example and describe how it could impact the reliability of the resulting data. Explain how the problem should be corrected.

A study is commissioned to determine the favorite brand of fruit juice among teens in California.

a. The survey is commissioned by the seller of a popular brand of apple juice.

b. There are only two types of juice included in the study: apple juice and cranberry juice.

c. Researchers allow participants to see the brand of juice as samples are poured for a taste test.

d. Twenty-five percent of participants prefer Brand X, 33% prefer Brand Y and 42% have no preference between the two brands. Brand X references the study in a commercial saying “Most teens like Brand X as much as or more than Brand Y.”

1.5 | Data Collection Experiment

Chapter 1 | Sampling and Data 39

 

 

1.1 Data Collection Experiment Class Time:

Names:

Student Learning Outcomes • The student will demonstrate the systematic sampling technique.

• The student will construct relative frequency tables.

• The student will interpret results and their differences from different data groupings.

Movie Survey Ask five classmates from a different class how many movies they saw at the theater last month. Do not include rented movies.

1. Record the data.

2. In class, randomly pick one person. On the class list, mark that person’s name. Move down four names on the class list. Mark that person’s name. Continue doing this until you have marked 12 names. You may need to go back to the start of the list. For each marked name record the five data values. You now have a total of 60 data values.

3. For each name marked, record the data.

___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___

___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___

___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___

___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___

___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___

Table 1.17

Order the Data Complete the two relative frequency tables below using your class data.

Number of Movies Frequency Relative Frequency Cumulative Relative Frequency

0

1

2

3

4

5

6

40 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Number of Movies Frequency Relative Frequency Cumulative Relative Frequency

7+

Table 1.18 Frequency of Number of Movies Viewed

Number of Movies Frequency Relative Frequency Cumulative Relative Frequency

0–1

2–3

4–5

6–7+

Table 1.19 Frequency of Number of Movies Viewed

1. Using the tables, find the percent of data that is at most two. Which table did you use and why?

2. Using the tables, find the percent of data that is at most three. Which table did you use and why?

3. Using the tables, find the percent of data that is more than two. Which table did you use and why?

4. Using the tables, find the percent of data that is more than three. Which table did you use and why?

Discussion Questions 1. Is one of the tables “more correct” than the other? Why or why not?

2. In general, how could you group the data differently? Are there any advantages to either way of grouping the data?

3. Why did you switch between tables, if you did, when answering the question above?

1.6 | Sampling Experiment

Chapter 1 | Sampling and Data 41

 

 

1.2 Sampling Experiment Class Time:

Names:

Student Learning Outcomes • The student will demonstrate the simple random, systematic, stratified, and cluster sampling techniques.

• The student will explain the details of each procedure used.

In this lab, you will be asked to pick several random samples of restaurants. In each case, describe your procedure briefly, including how you might have used the random number generator, and then list the restaurants in the sample you obtained.

NOTE

The following section contains restaurants stratified by city into columns and grouped horizontally by entree cost (clusters).

Restaurants Stratified by City and Entree Cost

Entree Cost Under $10 $10 to under $15

$15 to under $20 Over $20

San Jose El Abuelo Taq, Pasta Mia, Emma’s Express, Bamboo Hut

Emperor’s Guard, Creekside Inn

Agenda, Gervais, Miro’s

Blake’s, Eulipia, Hayes Mansion, Germania

Palo Alto Senor Taco, OliveGarden, Taxi’s Ming’s, P.A. Joe’s, Stickney’s

Scott’s Seafood, Poolside Grill, Fish Market

Sundance Mine, Maddalena’s, Spago’s

Los Gatos Mary’s Patio, Mount Everest, Sweet Pea’s, Andele Taqueria

Lindsey’s, Willow Street Toll House Charter House, La Maison Du Cafe

Mountain View

Maharaja, New Ma’s, Thai-Rific, Garden Fresh

Amber Indian, La Fiesta, Fiesta del Mar, Dawit

Austin’s, Shiva’s, Mazeh Le Petit Bistro

Cupertino Hobees, Hung Fu,Samrat, Panda Express

Santa Barb. Grill, Mand. Gourmet, Bombay Oven, Kathmandu West

Fontana’s, Blue Pheasant

Hamasushi, Helios

Sunnyvale Chekijababi, Taj India, Full Throttle, Tia Juana, Lemon Grass

Pacific Fresh, Charley Brown’s, Cafe Cameroon, Faz, Aruba’s

Lion & Compass, The Palace, Beau Sejour

Santa Clara

Rangoli, Armadillo Willy’s, Thai Pepper, Pasand

Arthur’s, Katie’s Cafe, Pedro’s, La Galleria

Birk’s, Truya Sushi, Valley Plaza

Lakeside, Mariani’s

Table 1.20 Restaurants Used in Sample

42 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

A Simple Random Sample Pick a simple random sample of 15 restaurants.

1. Describe your procedure.

2. Complete the table with your sample.

1. __________ 6. __________ 11. __________

2. __________ 7. __________ 12. __________

3. __________ 8. __________ 13. __________

4. __________ 9. __________ 14. __________

5. __________ 10. __________ 15. __________

Table 1.21

A Systematic Sample Pick a systematic sample of 15 restaurants.

1. Describe your procedure.

2. Complete the table with your sample.

1. __________ 6. __________ 11. __________

2. __________ 7. __________ 12. __________

3. __________ 8. __________ 13. __________

4. __________ 9. __________ 14. __________

5. __________ 10. __________ 15. __________

Table 1.22

A Stratified Sample Pick a stratified sample, by city, of 20 restaurants. Use 25% of the restaurants from each stratum. Round to the nearest whole number.

1. Describe your procedure.

2. Complete the table with your sample.

1. __________ 6. __________ 11. __________ 16. __________

2. __________ 7. __________ 12. __________ 17. __________

3. __________ 8. __________ 13. __________ 18. __________

4. __________ 9. __________ 14. __________ 19. __________

5. __________ 10. __________ 15. __________ 20. __________

Table 1.23

Chapter 1 | Sampling and Data 43

 

 

A Stratified Sample Pick a stratified sample, by entree cost, of 21 restaurants. Use 25% of the restaurants from each stratum. Round to the nearest whole number.

1. Describe your procedure.

2. Complete the table with your sample.

1. __________ 6. __________ 11. __________ 16. __________

2. __________ 7. __________ 12. __________ 17. __________

3. __________ 8. __________ 13. __________ 18. __________

4. __________ 9. __________ 14. __________ 19. __________

5. __________ 10. __________ 15. __________ 20. __________

21. __________

Table 1.24

A Cluster Sample Pick a cluster sample of restaurants from two cities. The number of restaurants will vary.

1. Describe your procedure.

2. Complete the table with your sample.

1. ________ 6. ________ 11. ________ 16. ________ 21. ________

2. ________ 7. ________ 12. ________ 17. ________ 22. ________

3. ________ 8. ________ 13. ________ 18. ________ 23. ________

4. ________ 9. ________ 14. ________ 19. ________ 24. ________

5. ________ 10. ________ 15. ________ 20. ________ 25. ________

Table 1.25

44 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Average

Blinding

Categorical Variable

Cluster Sampling

Continuous Random Variable

Control Group

Convenience Sampling

Cumulative Relative Frequency

Data

Discrete Random Variable

Double-blinding

Experimental Unit

Explanatory Variable

Frequency

Informed Consent

Institutional Review Board

Lurking Variable

Nonsampling Error

Numerical Variable

Parameter

Placebo

Population

Probability

KEY TERMS also called mean; a number that describes the central tendency of the data

not telling participants which treatment a subject is receiving

variables that take on values that are names or labels

a method for selecting a random sample and dividing the population into groups (clusters); use simple random sampling to select a set of clusters. Every individual in the chosen clusters is included in the sample.

a random variable (RV) whose outcomes are measured; the height of trees in the forest is a continuous RV.

a group in a randomized experiment that receives an inactive treatment but is otherwise managed exactly as the other groups

a nonrandom method of selecting a sample; this method selects individuals that are easily accessible and may result in biased data.

The term applies to an ordered set of observations from smallest to largest. The cumulative relative frequency is the sum of the relative frequencies for all values that are less than or equal to the given value.

a set of observations (a set of possible outcomes); most data can be put into two groups: qualitative (an attribute whose value is indicated by a label) or quantitative (an attribute whose value is indicated by a number). Quantitative data can be separated into two subgroups: discrete and continuous. Data is discrete if it is the result of counting (such as the number of students of a given ethnic group in a class or the number of books on a shelf). Data is continuous if it is the result of measuring (such as distance traveled or weight of luggage)

a random variable (RV) whose outcomes are counted

the act of blinding both the subjects of an experiment and the researchers who work with the subjects

any individual or object to be measured

the independent variable in an experiment; the value controlled by researchers

the number of times a value of the data occurs

Any human subject in a research study must be cognizant of any risks or costs associated with the study. The subject has the right to know the nature of the treatments included in the study, their potential risks, and their potential benefits. Consent must be given freely by an informed, fit participant.

a committee tasked with oversight of research programs that involve human subjects

a variable that has an effect on a study even though it is neither an explanatory variable nor a response variable

an issue that affects the reliability of sampling data other than natural variation; it includes a variety of human errors including poor study design, biased sampling methods, inaccurate information provided by study participants, data entry errors, and poor analysis.

variables that take on values that are indicated by numbers

a number that is used to represent a population characteristic and that generally cannot be determined easily

an inactive treatment that has no real effect on the explanatory variable

all individuals, objects, or measurements whose properties are being studied

a number between zero and one, inclusive, that gives the likelihood that a specific event will occur

Chapter 1 | Sampling and Data 45

 

 

Proportion

Qualitative Data

Quantitative Data

Random Assignment

Random Sampling

Relative Frequency

Representative Sample

Response Variable

Sample

Sampling Bias

Sampling Error

Sampling with Replacement

Sampling without Replacement

Simple Random Sampling

Statistic

Stratified Sampling

Systematic Sampling

Treatments

Variable

the number of successes divided by the total number in the sample

See Data.

See Data.

the act of organizing experimental units into treatment groups using random methods

a method of selecting a sample that gives every member of the population an equal chance of being selected.

the ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all outcomes to the total number of outcomes

a subset of the population that has the same characteristics as the population

the dependent variable in an experiment; the value that is measured for change at the end of an experiment

a subset of the population studied

not all members of the population are equally likely to be selected

the natural variation that results from selecting a sample to represent a larger population; this variation decreases as the sample size increases, so selecting larger samples reduces sampling error.

Once a member of the population is selected for inclusion in a sample, that member is returned to the population for the selection of the next individual.

A member of the population may be chosen for inclusion in a sample only once. If chosen, the member is not returned to the population before the next selection.

a straightforward method for selecting a random sample; give each member of the population a number. Use a random number generator to select a set of labels. These randomly selected labels identify the members of your sample.

a numerical characteristic of the sample; a statistic estimates the corresponding population parameter.

a method for selecting a random sample used to ensure that subgroups of the population are represented adequately; divide the population into groups (strata). Use simple random sampling to identify a proportionate number of individuals from each stratum.

a method for selecting a random sample; list the members of the population. Use simple random sampling to select a starting point in the population. Let k = (number of individuals in the population)/(number of individuals needed in the sample). Choose every kth individual in the list starting with the one that was randomly selected. If necessary, return to the beginning of the population list to complete your sample.

different values or components of the explanatory variable applied in an experiment

a characteristic of interest for each person or object in a population

CHAPTER REVIEW

1.1 Definitions of Statistics, Probability, and Key Terms The mathematical theory of statistics is easier to learn when you know the language. This module presents important terms that will be used throughout the text.

1.2 Data, Sampling, and Variation in Data and Sampling Data are individual items of information that come from a population or sample. Data may be classified as

46 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

qualitative(categorical), quantitative continuous, or quantitative discrete.

Because it is not practical to measure the entire population in a study, researchers use samples to represent the population. A random sample is a representative group from the population chosen by using a method that gives each individual in the population an equal chance of being included in the sample. Random sampling methods include simple random sampling, stratified sampling, cluster sampling, and systematic sampling. Convenience sampling is a nonrandom method of choosing a sample that often produces biased data.

Samples that contain different individuals result in different data. This is true even when the samples are well-chosen and representative of the population. When properly selected, larger samples model the population more closely than smaller samples. There are many different potential problems that can affect the reliability of a sample. Statistical data needs to be critically analyzed, not simply accepted.

1.3 Frequency, Frequency Tables, and Levels of Measurement Some calculations generate numbers that are artificially precise. It is not necessary to report a value to eight decimal places when the measures that generated that value were only accurate to the nearest tenth. Round off your final answer to one more decimal place than was present in the original data. This means that if you have data measured to the nearest tenth of a unit, report the final statistic to the nearest hundredth.

In addition to rounding your answers, you can measure your data using the following four levels of measurement.

• Nominal scale level: data that cannot be ordered nor can it be used in calculations

• Ordinal scale level: data that can be ordered; the differences cannot be measured

• Interval scale level: data with a definite ordering but no starting point; the differences can be measured, but there is no such thing as a ratio.

• Ratio scale level: data with a starting point that can be ordered; the differences have meaning and ratios can be calculated.

When organizing data, it is important to know how many times a value appears. How many statistics students study five hours or more for an exam? What percent of families on our block own two pets? Frequency, relative frequency, and cumulative relative frequency are measures that answer questions like these.

1.4 Experimental Design and Ethics A poorly designed study will not produce reliable data. There are certain key components that must be included in every experiment. To eliminate lurking variables, subjects must be assigned randomly to different treatment groups. One of the groups must act as a control group, demonstrating what happens when the active treatment is not applied. Participants in the control group receive a placebo treatment that looks exactly like the active treatments but cannot influence the response variable. To preserve the integrity of the placebo, both researchers and subjects may be blinded. When a study is designed properly, the only difference between treatment groups is the one imposed by the researcher. Therefore, when groups respond differently to different treatments, the difference must be due to the influence of the explanatory variable.

“An ethics problem arises when you are considering an action that benefits you or some cause you support, hurts or reduces benefits to others, and violates some rule.”[4] Ethical violations in statistics are not always easy to spot. Professional associations and federal agencies post guidelines for proper conduct. It is important that you learn basic statistical procedures so that you can recognize proper data analysis.

PRACTICE

1.1 Definitions of Statistics, Probability, and Key Terms Use the following information to answer the next five exercises. Studies are often done by pharmaceutical companies to determine the effectiveness of a treatment program. Suppose that a new AIDS antibody drug is currently under study. It is given to patients once the AIDS symptoms have revealed themselves. Of interest is the average (mean) length of time in months patients live once they start the treatment. Two researchers each follow a different set of 40 patients with AIDS from the start of treatment until their deaths. The following data (in months) are collected.

4. Andrew Gelman, “Open Data and Open Methods,” Ethics and Statistics, http://www.stat.columbia.edu/~gelman/ research/published/ChanceEthics1.pdf (accessed May 1, 2013).

Chapter 1 | Sampling and Data 47

 

 

Researcher A:

3; 4; 11; 15; 16; 17; 22; 44; 37; 16; 14; 24; 25; 15; 26; 27; 33; 29; 35; 44; 13; 21; 22; 10; 12; 8; 40; 32; 26; 27; 31; 34; 29; 17; 8; 24; 18; 47; 33; 34

Researcher B:

3; 14; 11; 5; 16; 17; 28; 41; 31; 18; 14; 14; 26; 25; 21; 22; 31; 2; 35; 44; 23; 21; 21; 16; 12; 18; 41; 22; 16; 25; 33; 34; 29; 13; 18; 24; 23; 42; 33; 29

Determine what the key terms refer to in the example for Researcher A.

1. population

2. sample

3. parameter

4. statistic

5. variable

1.2 Data, Sampling, and Variation in Data and Sampling

6. “Number of times per week” is what type of data?

a. qualitative(categorical); b. quantitative discrete; c. quantitative continuous

Use the following information to answer the next four exercises: A study was done to determine the age, number of times per week, and the duration (amount of time) of residents using a local park in San Antonio, Texas. The first house in the neighborhood around the park was selected randomly, and then the resident of every eighth house in the neighborhood around the park was interviewed.

7. The sampling method was

a. simple random; b. systematic; c. stratified; d. cluster

8. “Duration (amount of time)” is what type of data?

a. qualitative(categorical); b. quantitative discrete; c. quantitative continuous

9. The colors of the houses around the park are what kind of data?

a. qualitative(categorical); b. quantitative discrete; c. quantitative continuous

10. The population is ______________________

48 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

11. Table 1.26 contains the total number of deaths worldwide as a result of earthquakes from 2000 to 2012.

Year Total Number of Deaths

2000 231

2001 21,357

2002 11,685

2003 33,819

2004 228,802

2005 88,003

2006 6,605

2007 712

2008 88,011

2009 1,790

2010 320,120

2011 21,953

2012 768

Total 823,856

Table 1.26

Use Table 1.26 to answer the following questions.

a. What is the proportion of deaths between 2007 and 2012? b. What percent of deaths occurred before 2001? c. What is the percent of deaths that occurred in 2003 or after 2010? d. What is the fraction of deaths that happened before 2012? e. What kind of data is the number of deaths? f. Earthquakes are quantified according to the amount of energy they produce (examples are 2.1, 5.0, 6.7). What

type of data is that? g. What contributed to the large number of deaths in 2010? In 2004? Explain.

For the following four exercises, determine the type of sampling used (simple random, stratified, systematic, cluster, or convenience).

12. A group of test subjects is divided into twelve groups; then four of the groups are chosen at random.

13. A market researcher polls every tenth person who walks into a store.

14. The first 50 people who walk into a sporting event are polled on their television preferences.

15. A computer generates 100 random numbers, and 100 people whose names correspond with the numbers on the list are chosen.

Use the following information to answer the next seven exercises: Studies are often done by pharmaceutical companies to determine the effectiveness of a treatment program. Suppose that a new AIDS antibody drug is currently under study. It is given to patients once the AIDS symptoms have revealed themselves. Of interest is the average (mean) length of time in months patients live once starting the treatment. Two researchers each follow a different set of 40 AIDS patients from the start of treatment until their deaths. The following data (in months) are collected.

Researcher A: 3; 4; 11; 15; 16; 17; 22; 44; 37; 16; 14; 24; 25; 15; 26; 27; 33; 29; 35; 44; 13; 21; 22; 10; 12; 8; 40; 32; 26; 27; 31; 34; 29; 17; 8; 24; 18; 47; 33; 34

Researcher B: 3; 14; 11; 5; 16; 17; 28; 41; 31; 18; 14; 14; 26; 25; 21; 22; 31; 2; 35; 44; 23; 21; 21; 16; 12; 18; 41; 22; 16; 25; 33; 34; 29; 13; 18; 24; 23; 42; 33; 29

Chapter 1 | Sampling and Data 49

 

 

16. Complete the tables using the data provided:

Survival Length (in months) Frequency

Relative Frequency

Cumulative Relative Frequency

0.5–6.5

6.5–12.5

12.5–18.5

18.5–24.5

24.5–30.5

30.5–36.5

36.5–42.5

42.5–48.5

Table 1.27 Researcher A

Survival Length (in months) Frequency

Relative Frequency

Cumulative Relative Frequency

0.5–6.5

6.5–12.5

12.5–18.5

18.5–24.5

24.5–30.5

30.5–36.5

36.5-45.5

Table 1.28 Researcher B

17. Determine what the key term data refers to in the above example for Researcher A.

18. List two reasons why the data may differ.

19. Can you tell if one researcher is correct and the other one is incorrect? Why?

20. Would you expect the data to be identical? Why or why not?

21. Suggest at least two methods the researchers might use to gather random data.

22. Suppose that the first researcher conducted his survey by randomly choosing one state in the nation and then randomly picking 40 patients from that state. What sampling method would that researcher have used?

23. Suppose that the second researcher conducted his survey by choosing 40 patients he knew. What sampling method would that researcher have used? What concerns would you have about this data set, based upon the data collection method?

Use the following data to answer the next five exercises: Two researchers are gathering data on hours of video games played by school-aged children and young adults. They each randomly sample different groups of 150 students from the same school. They collect the following data.

Hours Played per Week Frequency Relative Frequency Cumulative Relative Frequency

0–2 26 0.17 0.17

Table 1.29 Researcher A

50 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Hours Played per Week Frequency Relative Frequency Cumulative Relative Frequency

2–4 30 0.20 0.37

4–6 49 0.33 0.70

6–8 25 0.17 0.87

8–10 12 0.08 0.95

10–12 8 0.05 1

Table 1.29 Researcher A

Hours Played per Week Frequency Relative Frequency Cumulative Relative Frequency

0–2 48 0.32 0.32

2–4 51 0.34 0.66

4–6 24 0.16 0.82

6–8 12 0.08 0.90

8–10 11 0.07 0.97

10–12 4 0.03 1

Table 1.30 Researcher B

24. Give a reason why the data may differ.

25. Would the sample size be large enough if the population is the students in the school?

26. Would the sample size be large enough if the population is school-aged children and young adults in the United States?

27. Researcher A concludes that most students play video games between four and six hours each week. Researcher B concludes that most students play video games between two and four hours each week. Who is correct?

28. As part of a way to reward students for participating in the survey, the researchers gave each student a gift card to a video game store. Would this affect the data if students knew about the award before the study?

Use the following data to answer the next five exercises: A pair of studies was performed to measure the effectiveness of a new software program designed to help stroke patients regain their problem-solving skills. Patients were asked to use the software program twice a day, once in the morning and once in the evening. The studies observed 200 stroke patients recovering over a period of several weeks. The first study collected the data in Table 1.31. The second study collected the data in Table 1.32.

Group Showed improvement No improvement Deterioration

Used program 142 43 15

Did not use program 72 110 18

Table 1.31

Group Showed improvement No improvement Deterioration

Used program 105 74 19

Did not use program 89 99 12

Table 1.32

29. Given what you know, which study is correct?

Chapter 1 | Sampling and Data 51

 

 

30. The first study was performed by the company that designed the software program. The second study was performed by the American Medical Association. Which study is more reliable?

31. Both groups that performed the study concluded that the software works. Is this accurate?

32. The company takes the two studies as proof that their software causes mental improvement in stroke patients. Is this a fair statement?

33. Patients who used the software were also a part of an exercise program whereas patients who did not use the software were not. Does this change the validity of the conclusions from Exercise 1.31?

34. Is a sample size of 1,000 a reliable measure for a population of 5,000?

35. Is a sample of 500 volunteers a reliable measure for a population of 2,500?

36. A question on a survey reads: “Do you prefer the delicious taste of Brand X or the taste of Brand Y?” Is this a fair question?

37. Is a sample size of two representative of a population of five?

38. Is it possible for two experiments to be well run with similar sample sizes to get different data?

1.3 Frequency, Frequency Tables, and Levels of Measurement

39. What type of measure scale is being used? Nominal, ordinal, interval or ratio. a. High school soccer players classified by their athletic ability: Superior, Average, Above average b. Baking temperatures for various main dishes: 350, 400, 325, 250, 300 c. The colors of crayons in a 24-crayon box d. Social security numbers e. Incomes measured in dollars f. A satisfaction survey of a social website by number: 1 = very satisfied, 2 = somewhat satisfied, 3 = not satisfied g. Political outlook: extreme left, left-of-center, right-of-center, extreme right h. Time of day on an analog watch i. The distance in miles to the closest grocery store j. The dates 1066, 1492, 1644, 1947, and 1944 k. The heights of 21–65 year-old women l. Common letter grades: A, B, C, D, and F

1.4 Experimental Design and Ethics

40. Design an experiment. Identify the explanatory and response variables. Describe the population being studied and the experimental units. Explain the treatments that will be used and how they will be assigned to the experimental units. Describe how blinding and placebos may be used to counter the power of suggestion.

41. Discuss potential violations of the rule requiring informed consent. a. Inmates in a correctional facility are offered good behavior credit in return for participation in a study. b. A research study is designed to investigate a new children’s allergy medication. c. Participants in a study are told that the new medication being tested is highly promising, but they are not told that

only a small portion of participants will receive the new medication. Others will receive placebo treatments and traditional treatments.

HOMEWORK

1.1 Definitions of Statistics, Probability, and Key Terms

For each of the following eight exercises, identify: a. the population, b. the sample, c. the parameter, d. the statistic, e. the variable, and f. the data. Give examples where appropriate.

42. A fitness center is interested in the mean amount of time a client exercises in the center each week.

43. Ski resorts are interested in the mean age that children take their first ski and snowboard lessons. They need this information to plan their ski classes optimally.

52 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

44. A cardiologist is interested in the mean recovery period of her patients who have had heart attacks.

45. Insurance companies are interested in the mean health costs each year of their clients, so that they can determine the costs of health insurance.

46. A politician is interested in the proportion of voters in his district who think he is doing a good job.

47. A marriage counselor is interested in the proportion of clients she counsels who stay married.

48. Political pollsters may be interested in the proportion of people who will vote for a particular cause.

49. A marketing company is interested in the proportion of people who will buy a particular product.

Use the following information to answer the next three exercises: A Lake Tahoe Community College instructor is interested in the mean number of days Lake Tahoe Community College math students are absent from class during a quarter.

50. What is the population she is interested in? a. all Lake Tahoe Community College students b. all Lake Tahoe Community College English students c. all Lake Tahoe Community College students in her classes d. all Lake Tahoe Community College math students

51. Consider the following:

X = number of days a Lake Tahoe Community College math student is absent

In this case, X is an example of a:

a. variable. b. population. c. statistic. d. data.

52. The instructor’s sample produces a mean number of days absent of 3.5 days. This value is an example of a: a. parameter. b. data. c. statistic. d. variable.

1.2 Data, Sampling, and Variation in Data and Sampling

For the following exercises, identify the type of data that would be used to describe a response (quantitative discrete, quantitative continuous, or qualitative), and give an example of the data.

53. number of tickets sold to a concert

54. percent of body fat

55. favorite baseball team

56. time in line to buy groceries

57. number of students enrolled at Evergreen Valley College

58. most-watched television show

59. brand of toothpaste

60. distance to the closest movie theatre

61. age of executives in Fortune 500 companies

62. number of competing computer spreadsheet software packages

Use the following information to answer the next two exercises: A study was done to determine the age, number of times per week, and the duration (amount of time) of resident use of a local park in San Jose. The first house in the neighborhood around the park was selected randomly and then every 8th house in the neighborhood around the park was interviewed.

Chapter 1 | Sampling and Data 53

 

 

63. “Number of times per week” is what type of data? a. qualitative b. quantitative discrete c. quantitative continuous

64. “Duration (amount of time)” is what type of data? a. qualitative b. quantitative discrete c. quantitative continuous

65. Airline companies are interested in the consistency of the number of babies on each flight, so that they have adequate safety equipment. Suppose an airline conducts a survey. Over Thanksgiving weekend, it surveys six flights from Boston to Salt Lake City to determine the number of babies on the flights. It determines the amount of safety equipment needed by the result of that study.

a. Using complete sentences, list three things wrong with the way the survey was conducted. b. Using complete sentences, list three ways that you would improve the survey if it were to be repeated.

66. Suppose you want to determine the mean number of students per statistics class in your state. Describe a possible sampling method in three to five complete sentences. Make the description detailed.

67. Suppose you want to determine the mean number of cans of soda drunk each month by students in their twenties at your school. Describe a possible sampling method in three to five complete sentences. Make the description detailed.

68. List some practical difficulties involved in getting accurate results from a telephone survey.

69. List some practical difficulties involved in getting accurate results from a mailed survey.

70. With your classmates, brainstorm some ways you could overcome these problems if you needed to conduct a phone or mail survey.

71. The instructor takes her sample by gathering data on five randomly selected students from each Lake Tahoe Community College math class. The type of sampling she used is

a. cluster sampling b. stratified sampling c. simple random sampling d. convenience sampling

72. A study was done to determine the age, number of times per week, and the duration (amount of time) of residents using a local park in San Jose. The first house in the neighborhood around the park was selected randomly and then every eighth house in the neighborhood around the park was interviewed. The sampling method was:

a. simple random b. systematic c. stratified d. cluster

73. Name the sampling method used in each of the following situations: a. A woman in the airport is handing out questionnaires to travelers asking them to evaluate the airport’s service.

She does not ask travelers who are hurrying through the airport with their hands full of luggage, but instead asks all travelers who are sitting near gates and not taking naps while they wait.

b. A teacher wants to know if her students are doing homework, so she randomly selects rows two and five and then calls on all students in row two and all students in row five to present the solutions to homework problems to the class.

c. The marketing manager for an electronics chain store wants information about the ages of its customers. Over the next two weeks, at each store location, 100 randomly selected customers are given questionnaires to fill out asking for information about age, as well as about other variables of interest.

d. The librarian at a public library wants to determine what proportion of the library users are children. The librarian has a tally sheet on which she marks whether books are checked out by an adult or a child. She records this data for every fourth patron who checks out books.

e. A political party wants to know the reaction of voters to a debate between the candidates. The day after the debate, the party’s polling staff calls 1,200 randomly selected phone numbers. If a registered voter answers the phone or is available to come to the phone, that registered voter is asked whom he or she intends to vote for and whether the debate changed his or her opinion of the candidates.

54 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

74. A “random survey” was conducted of 3,274 people of the “microprocessor generation” (people born since 1971, the year the microprocessor was invented). It was reported that 48% of those individuals surveyed stated that if they had $2,000 to spend, they would use it for computer equipment. Also, 66% of those surveyed considered themselves relatively savvy computer users.

a. Do you consider the sample size large enough for a study of this type? Why or why not? b. Based on your “gut feeling,” do you believe the percents accurately reflect the U.S. population for those

individuals born since 1971? If not, do you think the percents of the population are actually higher or lower than the sample statistics? Why? Additional information: The survey, reported by Intel Corporation, was filled out by individuals who visited the Los Angeles Convention Center to see the Smithsonian Institute’s road show called “America’s Smithsonian.”

c. With this additional information, do you feel that all demographic and ethnic groups were equally represented at the event? Why or why not?

d. With the additional information, comment on how accurately you think the sample statistics reflect the population parameters.

75. The Well-Being Index is a survey that follows trends of U.S. residents on a regular basis. There are six areas of health and wellness covered in the survey: Life Evaluation, Emotional Health, Physical Health, Healthy Behavior, Work Environment, and Basic Access. Some of the questions used to measure the Index are listed below.

Identify the type of data obtained from each question used in this survey: qualitative, quantitative discrete, or quantitative continuous.

a. Do you have any health problems that prevent you from doing any of the things people your age can normally do? b. During the past 30 days, for about how many days did poor health keep you from doing your usual activities? c. In the last seven days, on how many days did you exercise for 30 minutes or more? d. Do you have health insurance coverage?

76. In advance of the 1936 Presidential Election, a magazine titled Literary Digest released the results of an opinion poll predicting that the republican candidate Alf Landon would win by a large margin. The magazine sent post cards to approximately 10,000,000 prospective voters. These prospective voters were selected from the subscription list of the magazine, from automobile registration lists, from phone lists, and from club membership lists. Approximately 2,300,000 people returned the postcards.

a. Think about the state of the United States in 1936. Explain why a sample chosen from magazine subscription lists, automobile registration lists, phone books, and club membership lists was not representative of the population of the United States at that time.

b. What effect does the low response rate have on the reliability of the sample? c. Are these problems examples of sampling error or nonsampling error? d. During the same year, George Gallup conducted his own poll of 30,000 prospective voters. These researchers used

a method they called “quota sampling” to obtain survey answers from specific subsets of the population. Quota sampling is an example of which sampling method described in this module?

77. Crime-related and demographic statistics for 47 US states in 1960 were collected from government agencies, including the FBI’s Uniform Crime Report. One analysis of this data found a strong connection between education and crime indicating that higher levels of education in a community correspond to higher crime rates.

Which of the potential problems with samples discussed in Section 1.2 could explain this connection?

78. YouPolls is a website that allows anyone to create and respond to polls. One question posted April 15 asks:

“Do you feel happy paying your taxes when members of the Obama administration are allowed to ignore their tax liabilities?”[5]

As of April 25, 11 people responded to this question. Each participant answered “NO!”

Which of the potential problems with samples discussed in this module could explain this connection?

5. lastbaldeagle. 2013. On Tax Day, House to Call for Firing Federal Workers Who Owe Back Taxes. Opinion poll posted online at: http://www.youpolls.com/details.aspx?id=12328 (accessed May 1, 2013).

Chapter 1 | Sampling and Data 55

 

 

79. A scholarly article about response rates begins with the following quote:

“Declining contact and cooperation rates in random digit dial (RDD) national telephone surveys raise serious concerns about the validity of estimates drawn from such research.”[6]

The Pew Research Center for People and the Press admits:

“The percentage of people we interview – out of all we try to interview – has been declining over the past decade or more.”[7]

a. What are some reasons for the decline in response rate over the past decade? b. Explain why researchers are concerned with the impact of the declining response rate on public opinion polls.

1.3 Frequency, Frequency Tables, and Levels of Measurement

80. Fifty part-time students were asked how many courses they were taking this term. The (incomplete) results are shown below:

# of Courses Frequency Relative Frequency Cumulative Relative Frequency

1 30 0.6

2 15

3

Table 1.33 Part-time Student Course Loads

a. Fill in the blanks in Table 1.33. b. What percent of students take exactly two courses? c. What percent of students take one or two courses?

81. Sixty adults with gum disease were asked the number of times per week they used to floss before their diagnosis. The (incomplete) results are shown in Table 1.34.

# Flossing per Week Frequency Relative Frequency Cumulative Relative Freq.

0 27 0.4500

1 18

3 0.9333

6 3 0.0500

7 1 0.0167

Table 1.34 Flossing Frequency for Adults with Gum Disease

a. Fill in the blanks in Table 1.34. b. What percent of adults flossed six times per week? c. What percent flossed at most three times per week?

6. Scott Keeter et al., “Gauging the Impact of Growing Nonresponse on Estimates from a National RDD Telephone Survey,” Public Opinion Quarterly 70 no. 5 (2006), http://poq.oxfordjournals.org/content/70/5/759.full (http://poq.oxfordjournals.org/content/70/5/759.full) (accessed May 1, 2013). 7. Frequently Asked Questions, Pew Research Center for the People & the Press, http://www.people-press.org/ methodology/frequently-asked-questions/#dont-you-have-trouble-getting-people-to-answer-your-polls (accessed May 1, 2013).

56 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

82. Nineteen immigrants to the U.S were asked how many years, to the nearest year, they have lived in the U.S. The data are as follows: 2; 5; 7; 2; 2; 10; 20; 15; 0; 7; 0; 20; 5; 12; 15; 12; 4; 5; 10.

Table 1.35 was produced.

Data Frequency Relative Frequency Cumulative Relative Frequency

0 2 2 19 0.1053

2 3 3 19 0.2632

4 1 1 19 0.3158

5 3 3 19 0.4737

7 2 2 19 0.5789

10 2 2 19 0.6842

12 2 2 19 0.7895

15 1 1 19 0.8421

20 1 1 19 1.0000

Table 1.35 Frequency of Immigrant Survey Responses

a. Fix the errors in Table 1.35. Also, explain how someone might have arrived at the incorrect number(s). b. Explain what is wrong with this statement: “47 percent of the people surveyed have lived in the U.S. for 5 years.” c. Fix the statement in b to make it correct. d. What fraction of the people surveyed have lived in the U.S. five or seven years? e. What fraction of the people surveyed have lived in the U.S. at most 12 years? f. What fraction of the people surveyed have lived in the U.S. fewer than 12 years? g. What fraction of the people surveyed have lived in the U.S. from five to 20 years, inclusive?

83. How much time does it take to travel to work? Table 1.36 shows the mean commute time by state for workers at least 16 years old who are not working at home. Find the mean travel time, and round off the answer properly.

24.0 24.3 25.9 18.9 27.5 17.9 21.8 20.9 16.7 27.3

18.2 24.7 20.0 22.6 23.9 18.0 31.4 22.3 24.0 25.5

24.7 24.6 28.1 24.9 22.6 23.6 23.4 25.7 24.8 25.5

21.2 25.7 23.1 23.0 23.9 26.0 16.3 23.1 21.4 21.5

27.0 27.0 18.6 31.7 23.3 30.1 22.9 23.3 21.7 18.6

Table 1.36

Chapter 1 | Sampling and Data 57

 

 

84. Forbes magazine published data on the best small firms in 2012. These were firms which had been publicly traded for at least a year, have a stock price of at least $5 per share, and have reported annual revenue between $5 million and $1 billion. Table 1.37 shows the ages of the chief executive officers for the first 60 ranked firms.

Age Frequency Relative Frequency Cumulative Relative Frequency

40–44 3

45–49 11

50–54 13

55–59 16

60–64 10

65–69 6

70–74 1

Table 1.37

a. What is the frequency for CEO ages between 54 and 65? b. What percentage of CEOs are 65 years or older? c. What is the relative frequency of ages under 50? d. What is the cumulative relative frequency for CEOs younger than 55? e. Which graph shows the relative frequency and which shows the cumulative relative frequency?

(a) (b)

Figure 1.13

Use the following information to answer the next two exercises: Table 1.38 contains data on hurricanes that have made direct hits on the U.S. Between 1851 and 2004. A hurricane is given a strength category rating based on the minimum wind speed generated by the storm.

Category Number of Direct Hits Relative Frequency Cumulative Frequency

1 109 0.3993 0.3993

2 72 0.2637 0.6630

3 71 0.2601

Total = 273

Table 1.38 Frequency of Hurricane Direct Hits

58 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Category Number of Direct Hits Relative Frequency Cumulative Frequency

4 18 0.9890

5 3 0.0110 1.0000

Total = 273

Table 1.38 Frequency of Hurricane Direct Hits

85. What is the relative frequency of direct hits that were category 4 hurricanes? a. 0.0768 b. 0.0659 c. 0.2601 d. Not enough information to calculate

86. What is the relative frequency of direct hits that were AT MOST a category 3 storm? a. 0.3480 b. 0.9231 c. 0.2601 d. 0.3370

1.4 Experimental Design and Ethics

87. How does sleep deprivation affect your ability to drive? A recent study measured the effects on 19 professional drivers. Each driver participated in two experimental sessions: one after normal sleep and one after 27 hours of total sleep deprivation. The treatments were assigned in random order. In each session, performance was measured on a variety of tasks including a driving simulation.

Use key terms from this module to describe the design of this experiment.

88. An advertisement for Acme Investments displays the two graphs in Figure 1.14 to show the value of Acme’s product in comparison with the Other Guy’s product. Describe the potentially misleading visual effect of these comparison graphs. How can this be corrected?

(a) (b)

Figure 1.14 As the graphs show, Acme consistently outperforms the Other Guys!

Chapter 1 | Sampling and Data 59

 

 

89. The graph in Figure 1.15 shows the number of complaints for six different airlines as reported to the US Department of Transportation in February 2013. Alaska, Pinnacle, and Airtran Airlines have far fewer complaints reported than American, Delta, and United. Can we conclude that American, Delta, and United are the worst airline carriers since they have the most complaints?

Figure 1.15

BRINGING IT TOGETHER: HOMEWORK 90. Seven hundred and seventy-one distance learning students at Long Beach City College responded to surveys in the 2010-11 academic year. Highlights of the summary report are listed in Table 1.39.

Have computer at home 96%

Unable to come to campus for classes 65%

Age 41 or over 24%

Would like LBCC to offer more DL courses 95%

Took DL classes due to a disability 17%

Live at least 16 miles from campus 13%

Took DL courses to fulfill transfer requirements 71%

Table 1.39 LBCC Distance Learning Survey Results

a. What percent of the students surveyed do not have a computer at home? b. About how many students in the survey live at least 16 miles from campus? c. If the same survey were done at Great Basin College in Elko, Nevada, do you think the percentages would be the

same? Why?

60 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

91. Several online textbook retailers advertise that they have lower prices than on-campus bookstores. However, an important factor is whether the Internet retailers actually have the textbooks that students need in stock. Students need to be able to get textbooks promptly at the beginning of the college term. If the book is not available, then a student would not be able to get the textbook at all, or might get a delayed delivery if the book is back ordered.

A college newspaper reporter is investigating textbook availability at online retailers. He decides to investigate one textbook for each of the following seven subjects: calculus, biology, chemistry, physics, statistics, geology, and general engineering. He consults textbook industry sales data and selects the most popular nationally used textbook in each of these subjects. He visits websites for a random sample of major online textbook sellers and looks up each of these seven textbooks to see if they are available in stock for quick delivery through these retailers. Based on his investigation, he writes an article in which he draws conclusions about the overall availability of all college textbooks through online textbook retailers.

Write an analysis of his study that addresses the following issues: Is his sample representative of the population of all college textbooks? Explain why or why not. Describe some possible sources of bias in this study, and how it might affect the results of the study. Give some suggestions about what could be done to improve the study.

REFERENCES

1.1 Definitions of Statistics, Probability, and Key Terms

The Data and Story Library, http://lib.stat.cmu.edu/DASL/Stories/CrashTestDummies.html (accessed May 1, 2013).

1.2 Data, Sampling, and Variation in Data and Sampling

Gallup-Healthways Well-Being Index. http://www.well-beingindex.com/default.asp (accessed May 1, 2013).

Gallup-Healthways Well-Being Index. http://www.well-beingindex.com/methodology.asp (accessed May 1, 2013).

Gallup-Healthways Well-Being Index. http://www.gallup.com/poll/146822/gallup-healthways-index-questions.aspx (accessed May 1, 2013).

Data from http://www.bookofodds.com/Relationships-Society/Articles/A0374-How-George-Gallup-Picked-the-President

Dominic Lusinchi, “’President’ Landon and the 1936 Literary Digest Poll: Were Automobile and Telephone Owners to Blame?” Social Science History 36, no. 1: 23-54 (2012), http://ssh.dukejournals.org/content/36/1/23.abstract (accessed May 1, 2013).

“The Literary Digest Poll,” Virtual Laboratories in Probability and Statistics http://www.math.uah.edu/stat/data/ LiteraryDigest.html (accessed May 1, 2013).

“Gallup Presidential Election Trial-Heat Trends, 1936–2008,” Gallup Politics http://www.gallup.com/poll/110548/gallup- presidential-election-trialheat-trends-19362004.aspx#4 (accessed May 1, 2013).

The Data and Story Library, http://lib.stat.cmu.edu/DASL/Datafiles/USCrime.html (accessed May 1, 2013).

LBCC Distance Learning (DL) program data in 2010-2011, http://de.lbcc.edu/reports/2010-11/future/highlights.html#focus (accessed May 1, 2013).

Data from San Jose Mercury News

1.3 Frequency, Frequency Tables, and Levels of Measurement

“State & County QuickFacts,” U.S. Census Bureau. http://quickfacts.census.gov/qfd/download_data.html (accessed May 1, 2013).

“State & County QuickFacts: Quick, easy access to facts about people, business, and geography,” U.S. Census Bureau. http://quickfacts.census.gov/qfd/index.html (accessed May 1, 2013).

“Table 5: Direct hits by mainland United States Hurricanes (1851-2004),” National Hurricane Center, http://www.nhc.noaa.gov/gifs/table5.gif (accessed May 1, 2013).

“Levels of Measurement,” http://infinity.cos.edu/faculty/woodbury/stats/tutorial/Data_Levels.htm (accessed May 1, 2013).

Chapter 1 | Sampling and Data 61

 

 

Courtney Taylor, “Levels of Measurement,” about.com, http://statistics.about.com/od/HelpandTutorials/a/Levels-Of- Measurement.htm (accessed May 1, 2013).

David Lane. “Levels of Measurement,” Connexions, http://cnx.org/content/m10809/latest/ (accessed May 1, 2013).

1.4 Experimental Design and Ethics

“Vitamin E and Health,” Nutrition Source, Harvard School of Public Health, http://www.hsph.harvard.edu/nutritionsource/ vitamin-e/ (accessed May 1, 2013).

Stan Reents. “Don’t Underestimate the Power of Suggestion,” athleteinme.com, http://www.athleteinme.com/ ArticleView.aspx?id=1053 (accessed May 1, 2013).

Ankita Mehta. “Daily Dose of Aspiring Helps Reduce Heart Attacks: Study,” International Business Times, July 21, 2011. Also available online at http://www.ibtimes.com/daily-dose-aspirin-helps-reduce-heart-attacks-study-300443 (accessed May 1, 2013).

The Data and Story Library, http://lib.stat.cmu.edu/DASL/Stories/ScentsandLearning.html (accessed May 1, 2013).

M.L. Jacskon et al., “Cognitive Components of Simulated Driving Performance: Sleep Loss effect and Predictors,” Accident Analysis and Prevention Journal, Jan no. 50 (2013), http://www.ncbi.nlm.nih.gov/pubmed/22721550 (accessed May 1, 2013).

“Earthquake Information by Year,” U.S. Geological Survey. http://earthquake.usgs.gov/earthquakes/eqarchives/year/ (accessed May 1, 2013).

“Fatality Analysis Report Systems (FARS) Encyclopedia,” National Highway Traffic and Safety Administration. http://www-fars.nhtsa.dot.gov/Main/index.aspx (accessed May 1, 2013).

Data from www.businessweek.com (accessed May 1, 2013).

Data from www.forbes.com (accessed May 1, 2013).

“America’s Best Small Companies,” http://www.forbes.com/best-small-companies/list/ (accessed May 1, 2013).

U.S. Department of Health and Human Services, Code of Federal Regulations Title 45 Public Welfare Department of Health and Human Services Part 46 Protection of Human Subjects revised January 15, 2009. Section 46.111:Criteria for IRB Approval of Research.

“April 2013 Air Travel Consumer Report,” U.S. Department of Transportation, April 11 (2013), http://www.dot.gov/ airconsumer/april-2013-air-travel-consumer-report (accessed May 1, 2013).

Lori Alden, “Statistics can be Misleading,” econoclass.com, http://www.econoclass.com/misleadingstats.html (accessed May 1, 2013).

Maria de los A. Medina, “Ethics in Statistics,” Based on “Building an Ethics Module for Business, Science, and Engineering Students” by Jose A. Cruz-Cruz and William Frey, Connexions, http://cnx.org/content/m15555/latest/ (accessed May 1, 2013).

SOLUTIONS 1 AIDS patients.

3 The average length of time (in months) AIDS patients live after treatment.

5 X = the length of time (in months) AIDS patients live after treatment

7 b

9 a

11 a. 0.5242

b. 0.03%

c. 6.86%

62 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

d. 823,088823,856

e. quantitative discrete

f. quantitative continuous

g. In both years, underwater earthquakes produced massive tsunamis.

13 systematic

15 simple random

17 values for X, such as 3, 4, 11, and so on

19 No, we do not have enough information to make such a claim.

21 Take a simple random sample from each group. One way is by assigning a number to each patient and using a random number generator to randomly select patients.

23 This would be convenience sampling and is not random.

25 Yes, the sample size of 150 would be large enough to reflect a population of one school.

27 Even though the specific data support each researcher’s conclusions, the different results suggest that more data need to be collected before the researchers can reach a conclusion.

29 There is not enough information given to judge if either one is correct or incorrect.

31 The software program seems to work because the second study shows that more patients improve while using the software than not. Even though the difference is not as large as that in the first study, the results from the second study are likely more reliable and still show improvement.

33 Yes, because we cannot tell if the improvement was due to the software or the exercise; the data is confounded, and a reliable conclusion cannot be drawn. New studies should be performed.

35 No, even though the sample is large enough, the fact that the sample consists of volunteers makes it a self-selected sample, which is not reliable.

37 No, even though the sample is a large portion of the population, two responses are not enough to justify any conclusions. Because the population is so small, it would be better to include everyone in the population to get the most accurate data.

39 a. ordinal

b. interval

c. nominal

d. nominal

e. ratio

f. ordinal

g. nominal

h. interval

i. ratio

j. interval

k. ratio

l. ordinal

41 a. Inmates may not feel comfortable refusing participation, or may feel obligated to take advantage of the promised

benefits. They may not feel truly free to refuse participation.

b. Parents can provide consent on behalf of their children, but children are not competent to provide consent for themselves.

Chapter 1 | Sampling and Data 63

 

 

c. All risks and benefits must be clearly outlined. Study participants must be informed of relevant aspects of the study in order to give appropriate consent.

43 a. all children who take ski or snowboard lessons

b. a group of these children

c. the population mean age of children who take their first snowboard lesson

d. the sample mean age of children who take their first snowboard lesson

e. X = the age of one child who takes his or her first ski or snowboard lesson

f. values for X, such as 3, 7, and so on

45 a. the clients of the insurance companies

b. a group of the clients

c. the mean health costs of the clients

d. the mean health costs of the sample

e. X = the health costs of one client

f. values for X, such as 34, 9, 82, and so on

47 a. all the clients of this counselor

b. a group of clients of this marriage counselor

c. the proportion of all her clients who stay married

d. the proportion of the sample of the counselor’s clients who stay married

e. X = the number of couples who stay married

f. yes, no

49 a. all people (maybe in a certain geographic area, such as the United States)

b. a group of the people

c. the proportion of all people who will buy the product

d. the proportion of the sample who will buy the product

e. X = the number of people who will buy it

f. buy, not buy

51 a

53 quantitative discrete, 150

55 qualitative, Oakland A’s

57 quantitative discrete, 11,234 students

59 qualitative, Crest

61 quantitative continuous, 47.3 years

63 b

65 a. The survey was conducted using six similar flights.

The survey would not be a true representation of the entire population of air travelers. Conducting the survey on a holiday weekend will not produce representative results.

64 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

b. Conduct the survey during different times of the year. Conduct the survey using flights to and from various locations. Conduct the survey on different days of the week.

67 Answers will vary. Sample Answer: You could use a systematic sampling method. Stop the tenth person as they leave one of the buildings on campus at 9:50 in the morning. Then stop the tenth person as they leave a different building on campus at 1:50 in the afternoon.

69 Answers will vary. Sample Answer: Many people will not respond to mail surveys. If they do respond to the surveys, you can’t be sure who is responding. In addition, mailing lists can be incomplete.

71 b

73 convenience; cluster; stratified ; systematic; simple random

75 a. qualitative

b. quantitative discrete

c. quantitative discrete

d. qualitative

77 Causality: The fact that two variables are related does not guarantee that one variable is influencing the other. We cannot assume that crime rate impacts education level or that education level impacts crime rate. Confounding: There are many factors that define a community other than education level and crime rate. Communities with high crime rates and high education levels may have other lurking variables that distinguish them from communities with lower crime rates and lower education levels. Because we cannot isolate these variables of interest, we cannot draw valid conclusions about the connection between education and crime. Possible lurking variables include police expenditures, unemployment levels, region, average age, and size.

79 a. Possible reasons: increased use of caller id, decreased use of landlines, increased use of private numbers, voice mail,

privacy managers, hectic nature of personal schedules, decreased willingness to be interviewed

b. When a large number of people refuse to participate, then the sample may not have the same characteristics of the population. Perhaps the majority of people willing to participate are doing so because they feel strongly about the subject of the survey.

81 a.

# Flossing per Week Frequency Relative Frequency Cumulative Relative Frequency

0 27 0.4500 0.4500

1 18 0.3000 0.7500

3 11 0.1833 0.9333

6 3 0.0500 0.9833

7 1 0.0167 1

Table 1.40

b. 5.00%

c. 93.33%

83 The sum of the travel times is 1,173.1. Divide the sum by 50 to calculate the mean value: 23.462. Because each state’s travel time was measured to the nearest tenth, round this calculation to the nearest hundredth: 23.46.

85 b

Chapter 1 | Sampling and Data 65

 

 

87 Explanatory variable: amount of sleep Response variable: performance measured in assigned tasks Treatments: normal sleep and 27 hours of total sleep deprivation Experimental Units: 19 professional drivers Lurking variables: none – all drivers participated in both treatments Random assignment: treatments were assigned in random order; this eliminated the effect of any “learning” that may take place during the first experimental session Control/Placebo: completing the experimental session under normal sleep conditions Blinding: researchers evaluating subjects’ performance must not know which treatment is being applied at the time

89 You cannot assume that the numbers of complaints reflect the quality of the airlines. The airlines shown with the greatest number of complaints are the ones with the most passengers. You must consider the appropriateness of methods for presenting data; in this case displaying totals is misleading.

91 Answers will vary. Sample answer: The sample is not representative of the population of all college textbooks. Two reasons why it is not representative are that he only sampled seven subjects and he only investigated one textbook in each subject. There are several possible sources of bias in the study. The seven subjects that he investigated are all in mathematics and the sciences; there are many subjects in the humanities, social sciences, and other subject areas, (for example: literature, art, history, psychology, sociology, business) that he did not investigate at all. It may be that different subject areas exhibit different patterns of textbook availability, but his sample would not detect such results. He also looked only at the most popular textbook in each of the subjects he investigated. The availability of the most popular textbooks may differ from the availability of other textbooks in one of two ways:

• the most popular textbooks may be more readily available online, because more new copies are printed, and more students nationwide are selling back their used copies OR

• the most popular textbooks may be harder to find available online, because more student demand exhausts the supply more quickly.

In reality, many college students do not use the most popular textbook in their subject, and this study gives no useful information about the situation for those less popular textbooks. He could improve this study by:

• expanding the selection of subjects he investigates so that it is more representative of all subjects studied by college students, and

• expanding the selection of textbooks he investigates within each subject to include a mixed representation of both the most popular and less popular textbooks.

66 Chapter 1 | Sampling and Data

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

2 | DESCRIPTIVE STATISTICS

Figure 2.1 When you have large amounts of data, you will need to organize it in a way that makes sense. These ballots from an election are rolled together with similar ballots to keep them organized. (credit: William Greeson)

Introduction

Chapter Objectives

By the end of this chapter, the student should be able to:

• Display data graphically and interpret graphs: stemplots, histograms, and box plots. • Recognize, describe, and calculate the measures of location of data: quartiles and percentiles. • Recognize, describe, and calculate the measures of the center of data: mean, median, and mode. • Recognize, describe, and calculate the measures of the spread of data: variance, standard deviation, and

range.

Once you have collected data, what will you do with it? Data can be described and presented in many different formats. For example, suppose you are interested in buying a house in a particular area. You may have no clue about the house prices, so you might ask your real estate agent to give you a sample data set of prices. Looking at all the prices in the sample often is overwhelming. A better way might be to look at the median price and the variation of prices. The median and variation are just two ways that you will learn to describe data. Your agent might also provide you with a graph of the data.

Chapter 2 | Descriptive Statistics 67

 

 

In this chapter, you will study numerical and graphical ways to describe and display your data. This area of statistics is called “Descriptive Statistics.” You will learn how to calculate, and even more importantly, how to interpret these measurements and graphs.

A statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population. A graph can be a more effective way of presenting data than a mass of numbers because we can see where data clusters and where there are only a few data values. Newspapers and the Internet use graphs to show trends and to enable readers to compare facts and figures quickly. Statisticians often graph data first to get a picture of the data. Then, more formal tools may be applied.

Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar graph, the histogram, the stem-and-leaf plot, the frequency polygon (a type of broken line graph), the pie chart, and the box plot. In this chapter, we will briefly look at stem-and-leaf plots, line graphs, and bar graphs, as well as frequency polygons, and time series graphs. Our emphasis will be on histograms and box plots.

NOTE

This book contains instructions for constructing a histogram and a box plot for the TI-83+ and TI-84 calculators. The Texas Instruments (TI) website (http://education.ti.com/educationportal/sites/US/sectionHome/ support.html) provides additional instructions for using these calculators.

2.1 | Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis. It is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit. For example, 23 has stem two and leaf three. The number 432 has stem 43 and leaf two. Likewise, the number 5,432 has stem 543 and leaf two. The decimal 9.3 has stem nine and leaf three. Write the stems in a vertical line from smallest to largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem.

Example 2.1

For Susan Dean’s spring pre-calculus class, scores for the first exam were as follows (smallest to largest): 33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90; 92; 94; 94; 94; 94; 96; 100

Stem Leaf

3 3

4 2 9 9

5 3 5 5

6 1 3 7 8 8 9 9

7 2 3 4 8

8 0 3 8 8 8

9 0 2 4 4 4 4 6

10 0

Table 2.1 Stem-and- Leaf Graph

The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or approximately

68 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

26% ⎛⎝ 8 31

⎞ ⎠ were in the 90s or 100, a fairly high number of As.

2.1 For the Park City basketball team, scores for the last 30 games were as follows (smallest to largest): 32; 32; 33; 34; 38; 40; 42; 42; 43; 44; 46; 47; 47; 48; 48; 48; 49; 50; 50; 51; 52; 52; 52; 53; 54; 56; 57; 57; 60; 61 Construct a stem plot for the data.

The stemplot is a quick way to graph data and gives an exact picture of the data. You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may indicate that something unusual is happening. It takes some background information to explain outliers, so we will cover them in more detail later.

Chapter 2 | Descriptive Statistics 69

 

 

Example 2.2

The data are the distances (in kilometers) from a home to local supermarkets. Create a stemplot using the data: 1.1; 1.5; 2.3; 2.5; 2.7; 3.2; 3.3; 3.3; 3.5; 3.8; 4.0; 4.2; 4.5; 4.5; 4.7; 4.8; 5.5; 5.6; 6.5; 6.7; 12.3

Do the data seem to have any concentration of values?

NOTE

The leaves are to the right of the decimal.

Solution 2.2

The value 12.3 may be an outlier. Values appear to concentrate at three and four kilometers.

Stem Leaf

1 1 5

2 3 5 7

3 2 3 3 5 8

4 0 2 5 5 7 8

5 5 6

6 5 7

7

8

9

10

11

12 3

Table 2.2

2.2 The following data show the distances (in miles) from the homes of off-campus statistics students to the college. Create a stem plot using the data and identify any outliers:

0.5; 0.7; 1.1; 1.2; 1.2; 1.3; 1.3; 1.5; 1.5; 1.7; 1.7; 1.8; 1.9; 2.0; 2.2; 2.5; 2.6; 2.8; 2.8; 2.8; 3.5; 3.8; 4.4; 4.8; 4.9; 5.2; 5.5; 5.7; 5.8; 8.0

Example 2.3

A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns. In a side-by-side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are to the left and the right of the stems. Table 2.4 and Table 2.5 show the ages of presidents at their inauguration and at their death. Construct a side- by-side stem-and-leaf plot using this data.

70 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Solution 2.3

Ages at Inauguration Ages at Death

9 9 8 7 7 7 6 3 2 4 6 9

8 7 7 7 7 6 6 6 5 5 5 5 4 4 4 4 4 2 2 1 1 1 1 1 0 5 3 6 6 7 7 8

9 8 5 4 4 2 1 1 1 0 6 0 0 3 3 4 4 5 6 7 7 7 8

7 0 0 1 1 1 4 7 8 8 9

8 0 1 3 5 8

9 0 0 3 3

Table 2.3

President Age President Age President Age

Washington 57 Lincoln 52 Hoover 54

J. Adams 61 A. Johnson 56 F. Roosevelt 51

Jefferson 57 Grant 46 Truman 60

Madison 57 Hayes 54 Eisenhower 62

Monroe 58 Garfield 49 Kennedy 43

J. Q. Adams 57 Arthur 51 L. Johnson 55

Jackson 61 Cleveland 47 Nixon 56

Van Buren 54 B. Harrison 55 Ford 61

W. H. Harrison 68 Cleveland 55 Carter 52

Tyler 51 McKinley 54 Reagan 69

Polk 49 T. Roosevelt 42 G.H.W. Bush 64

Taylor 64 Taft 51 Clinton 47

Fillmore 50 Wilson 56 G. W. Bush 54

Pierce 48 Harding 55 Obama 47

Buchanan 65 Coolidge 51

Table 2.4 Presidential Ages at Inauguration

President Age President Age President Age

Washington 67 Lincoln 56 Hoover 90

J. Adams 90 A. Johnson 66 F. Roosevelt 63

Jefferson 83 Grant 63 Truman 88

Madison 85 Hayes 70 Eisenhower 78

Monroe 73 Garfield 49 Kennedy 46

Table 2.5 Presidential Age at Death

Chapter 2 | Descriptive Statistics 71

 

 

President Age President Age President Age

J. Q. Adams 80 Arthur 56 L. Johnson 64

Jackson 78 Cleveland 71 Nixon 81

Van Buren 79 B. Harrison 67 Ford 93

W. H. Harrison 68 Cleveland 71 Reagan 93

Tyler 71 McKinley 58

Polk 53 T. Roosevelt 60

Taylor 65 Taft 72

Fillmore 74 Wilson 67

Pierce 64 Harding 57

Buchanan 77 Coolidge 60

Table 2.5 Presidential Age at Death

2.3 The table shows the number of wins and losses the Atlanta Hawks have had in 42 seasons. Create a side-by-side stem-and-leaf plot of these wins and losses.

Losses Wins Year Losses Wins Year

34 48 1968–1969 41 41 1989–1990

34 48 1969–1970 39 43 1990–1991

46 36 1970–1971 44 38 1991–1992

46 36 1971–1972 39 43 1992–1993

36 46 1972–1973 25 57 1993–1994

47 35 1973–1974 40 42 1994–1995

51 31 1974–1975 36 46 1995–1996

53 29 1975–1976 26 56 1996–1997

51 31 1976–1977 32 50 1997–1998

41 41 1977–1978 19 31 1998–1999

36 46 1978–1979 54 28 1999–2000

32 50 1979–1980 57 25 2000–2001

51 31 1980–1981 49 33 2001–2002

40 42 1981–1982 47 35 2002–2003

39 43 1982–1983 54 28 2003–2004

42 40 1983–1984 69 13 2004–2005

48 34 1984–1985 56 26 2005–2006

32 50 1985–1986 52 30 2006–2007

Table 2.6

72 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Losses Wins Year Losses Wins Year

25 57 1986–1987 45 37 2007–2008

32 50 1987–1988 35 47 2008–2009

30 52 1988–1989 29 53 2009–2010

Table 2.6

Another type of graph that is useful for specific data values is a line graph. In the particular line graph shown in Example 2.4, the x-axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency points. The frequency points are connected using line segments.

Example 2.4

In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do his or her chores. The results are shown in Table 2.7 and in Figure 2.2.

Number of times teenager is reminded Frequency

0 2

1 5

2 8

3 14

4 7

5 4

Table 2.7

Figure 2.2

Chapter 2 | Descriptive Statistics 73

 

 

2.4 In a survey, 40 people were asked how many times per year they had their car in the shop for repairs. The results are shown in Table 2.8. Construct a line graph.

Number of times in shop Frequency

0 7

1 10

2 14

3 9

Table 2.8

Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be rectangular boxes (used in three-dimensional plots), and they can be vertical or horizontal. The bar graph shown in Example 2.5 has age groups represented on the x-axis and proportions on the y-axis.

Example 2.5

By the end of 2011, Facebook had over 146 million users in the United States. Table 2.8 shows three age groups, the number of users in each age group, and the proportion (%) of users in each age group. Construct a bar graph using this data.

Age groups Number of Facebook users Proportion (%) of Facebook users

13–25 65,082,280 45%

26–44 53,300,200 36%

45–64 27,885,100 19%

Table 2.9

74 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Solution 2.5

Figure 2.3

2.5 The population in Park City is made up of children, working-age adults, and retirees. Table 2.10 shows the three age groups, the number of people in the town from each age group, and the proportion (%) of people in each age group. Construct a bar graph showing the proportions.

Age groups Number of people Proportion of population

Children 67,059 19%

Working-age adults 152,198 43%

Retirees 131,662 38%

Table 2.10

Example 2.6

The columns in Table 2.10 contain: the race or ethnicity of students in U.S. Public Schools for the class of 2011, percentages for the Advanced Placement examine population for that class, and percentages for the overall student population. Create a bar graph with the student race or ethnicity (qualitative data) on the x-axis, and the Advanced Placement examinee population percentages on the y-axis.

Chapter 2 | Descriptive Statistics 75

 

 

Race/Ethnicity AP ExamineePopulation Overall Student Population

1 = Asian, Asian American or Pacific Islander 10.3% 5.7%

2 = Black or African American 9.0% 14.7%

3 = Hispanic or Latino 17.0% 17.6%

4 = American Indian or Alaska Native 0.6% 1.1%

5 = White 57.1% 59.2%

6 = Not reported/other 6.0% 1.7%

Table 2.11

Solution 2.6

Figure 2.4

2.6 Park city is broken down into six voting districts. The table shows the percent of the total registered voter population that lives in each district as well as the percent total of the entire population that lives in each district. Construct a bar graph that shows the registered voter population by district.

District Registered voter population Overall city population

1 15.5% 19.4%

2 12.2% 15.6%

Table 2.12

76 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

District Registered voter population Overall city population

3 9.8% 9.0%

4 17.4% 18.5%

5 22.8% 20.7%

6 22.3% 16.8%

Table 2.12

2.2 | Histograms, Frequency Polygons, and Time Series Graphs For most of the work you do in this book, you will use a histogram to display the data. One advantage of a histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set consists of 100 values or more.

A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The horizontal axis is labeled with what the data represents (for instance, distance from your home to school). The vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The graph will have the same shape with either label. The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data.

The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample. (Remember, frequency is defined as the number of times an answer occurs.) If:

• f = frequency

• n = total number of data values (or the sum of the individual frequencies), and

• RF = relative frequency,

then:

RF = fn

For example, if three students in Mr. Ahab’s English class of 40 students received from 90% to 100%, then, f = 3, n = 40,

and RF = fn = 3 40 = 0.075. 7.5% of the students received 90–100%. 90–100% are quantitative measures.

To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many histograms consist of five to 15 bars or classes for clarity. The number of bars needs to be chosen. Choose a starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places. For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 – 0.005 = 1.495). If the value with the most decimal places is 3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 – 0.0005 = 0.9995). If all the data happen to be integers and the smallest value is two, then a convenient starting point is 1.5 (2 – 0.5 = 1.5). Also, when the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary. The next two examples go into detail about how to construct a histogram using continuous data and how to create a histogram using discrete data.

Example 2.7

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The heights are continuous data, since height is measured. 60; 60.5; 61; 61; 61.5

Chapter 2 | Descriptive Statistics 77

 

 

63.5; 63.5; 63.5 64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5 66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5 68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5 70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71 72; 72; 72; 72.5; 72.5; 73; 73.5 74

The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for the convenient starting point.

60 – 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95.

The largest value is 74, so 74 + 0.05 = 74.05 is the ending value.

Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of bars you desire). Suppose you choose eight bars.

74.05 − 59.95 8 = 1.76

NOTE

We will round up to two and make each bar or class interval two units wide. Rounding up to two is one way to prevent a value from falling on a boundary. Rounding to the next number is often necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the width would also work. A guideline that is followed by some for the number of bars or class intervals is to take the square root of the number of data values and then round to the nearest whole number, if necessary. For example, if there are 150 values of data, take the square root of 150 and round to 12 bars or intervals.

The boundaries are:

• 59.95

• 59.95 + 2 = 61.95

• 61.95 + 2 = 63.95

• 63.95 + 2 = 65.95

• 65.95 + 2 = 67.95

• 67.95 + 2 = 69.95

• 69.95 + 2 = 71.95

• 71.95 + 2 = 73.95

• 73.95 + 2 = 75.95

The heights 60 through 61.5 inches are in the interval 59.95–61.95. The heights that are 63.5 are in the interval 61.95–63.95. The heights that are 64 through 64.5 are in the interval 63.95–65.95. The heights 66 through 67.5 are in the interval 65.95–67.95. The heights 68 through 69.5 are in the interval 67.95–69.95. The heights 70 through 71 are in the interval 69.95–71.95. The heights 72 through 73.5 are in the interval 71.95–73.95. The height 74 is in the interval 73.95–75.95.

The following histogram displays the heights on the x-axis and relative frequency on the y-axis.

78 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Figure 2.5

2.7 The following data are the shoe sizes of 50 male students. The sizes are continuous data since shoe size is measured. Construct a histogram and calculate the width of each bar or class interval. Suppose you choose six bars. 9; 9; 9.5; 9.5; 10; 10; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5; 10.5 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5; 11.5; 11.5; 11.5; 11.5 12; 12; 12; 12; 12; 12; 12; 12.5; 12.5; 12.5; 12.5; 14

Example 2.8

Create a histogram for the following data: the number of books bought by 50 part-time college students at ABC College.the number of books bought by 50 part-time college students at ABC College. The number of books is discrete data, since books are counted. 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1 2; 2; 2; 2; 2; 2; 2; 2; 2; 2 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3 4; 4; 4; 4; 4; 4 5; 5; 5; 5; 5 6; 6

Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six students buy four books. Five students buy five books. Two students buy six books.

Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the largest data value. Then the starting point is 0.5 and the ending value is 6.5.

Next, calculate the width of each bar or class interval. If the data are discrete and there are not too many different values, a width that places the data values in the middle of the bar or class interval is the most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6, and the starting point is 0.5, a width of one places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the interval from _______ to _______, the 5 in the middle of the interval from _______ to _______, and the _______ in the middle of the interval from _______ to _______ .

Chapter 2 | Descriptive Statistics 79

 

 

Solution 2.8 • 3.5 to 4.5

• 4.5 to 5.5

• 6

• 5.5 to 6.5

Calculate the number of bars as follows:

6.5 − 0.5 number of bars = 1

where 1 is the width of a bar. Therefore, bars = 6.

The following histogram displays the number of books on the x-axis and the frequency on the y-axis.

Figure 2.6

Go to Appendix G. There are calculator instructions for entering data and for creating a customized histogram. Create the histogram for Example 2.8.

• Press Y=. Press CLEAR to delete any equations.

• Press STAT 1:EDIT. If L1 has data in it, arrow up into the name L1, press CLEAR and then arrow down. If necessary, do the same for L2.

• Into L1, enter 1, 2, 3, 4, 5, 6.

• Into L2, enter 11, 10, 16, 6, 5, 2.

• Press WINDOW. Set Xmin = .5, Xmax = 6.5, Xscl = (6.5 – .5)/6, Ymin = –1, Ymax = 20, Yscl = 1, Xres = 1.

• Press 2nd Y=. Start by pressing 4:Plotsoff ENTER.

• Press 2nd Y=. Press 1:Plot1. Press ENTER. Arrow down to TYPE. Arrow to the 3rd picture (histogram). Press ENTER.

• Arrow down to Xlist: Enter L1 (2nd 1). Arrow down to Freq. Enter L2 (2nd 2).

• Press GRAPH.

• Use the TRACE key and the arrow keys to examine the histogram.

80 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

2.8 The following data are the number of sports played by 50 student athletes. The number of sports is discrete data since sports are counted.

1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2 3; 3; 3; 3; 3; 3; 3; 3 20 student athletes play one sport. 22 student athletes play two sports. Eight student athletes play three sports.

Fill in the blanks for the following sentence. Since the data consist of the numbers 1, 2, 3, and the starting point is 0.5, a width of one places the 1 in the middle of the interval 0.5 to _____, the 2 in the middle of the interval from _____ to _____, and the 3 in the middle of the interval from _____ to _____.

Example 2.9

Using this data set, construct a histogram.

Number of Hours My Classmates Spent Playing Video Games on Weekends

9.95 10 2.25 16.75 0

19.5 22.5 7.5 15 12.75

5.5 11 10 20.75 17.5

23 21.9 24 23.75 18

20 15 22.9 18.8 20.5

Table 2.13

Chapter 2 | Descriptive Statistics 81

 

 

Solution 2.9

Figure 2.7

Some values in this data set fall on boundaries for the class intervals. A value is counted in a class interval if it falls on the left boundary, but not if it falls on the right boundary. Different researchers may set up histograms for the same data in different ways. There is more than one correct way to set up a histogram.

2.9 The following data represent the number of employees at various restaurants in New York City. Using this data, create a histogram.

22; 35; 15; 26; 40; 28; 18; 20; 25; 34; 39; 42; 24; 22; 19; 27; 22; 34; 40; 20; 38; and 28 Use 10–19 as the first interval.

Count the money (bills and change) in your pocket or purse. Your instructor will record the amounts. As a class, construct a histogram displaying the data. Discuss how many intervals you think is appropriate. You may want to experiment with the number of intervals.

Frequency Polygons Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to interpret, so too do frequency polygons.

To construct a frequency polygon, first examine the data and decide on the number of intervals, or class intervals, to use on the x-axis and y-axis. After choosing the appropriate ranges, begin plotting the data points. After all the points are plotted,

82 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

draw line segments to connect them.

Example 2.10

A frequency polygon was constructed from the frequency table below.

Frequency Distribution for Calculus Final Test Scores

Lower Bound Upper Bound Frequency Cumulative Frequency

49.5 59.5 5 5

59.5 69.5 10 15

69.5 79.5 30 45

79.5 89.5 40 85

89.5 99.5 15 100

Table 2.14

Figure 2.8

The first label on the x-axis is 44.5. This represents an interval extending from 39.5 to 49.5. Since the lowest test score is 54.5, this interval is used only to allow the graph to touch the x-axis. The point labeled 54.5 represents the next interval, or the first “real” interval from the table, and contains five scores. This reasoning is followed for each of the remaining intervals with the point 104.5 representing the interval from 99.5 to 109.5. Again, this interval contains no data and is only used so that the graph will touch the x-axis. Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror the other side.

2.10 Construct a frequency polygon of U.S. Presidents’ ages at inauguration shown in Table 2.15.

Chapter 2 | Descriptive Statistics 83

 

 

Age at Inauguration Frequency

41.5–46.5 4

46.5–51.5 11

51.5–56.5 14

56.5–61.5 9

61.5–66.5 4

66.5–71.5 2

Table 2.15

Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons drawn for different data sets.

Example 2.11

We will construct an overlay frequency polygon comparing the scores from Example 2.10 with the students’ final numeric grade.

Frequency Distribution for Calculus Final Test Scores

Lower Bound Upper Bound Frequency Cumulative Frequency

49.5 59.5 5 5

59.5 69.5 10 15

69.5 79.5 30 45

79.5 89.5 40 85

89.5 99.5 15 100

Table 2.16

Frequency Distribution for Calculus Final Grades

Lower Bound Upper Bound Frequency Cumulative Frequency

49.5 59.5 10 10

59.5 69.5 10 20

69.5 79.5 30 50

79.5 89.5 45 95

89.5 99.5 5 100

Table 2.17

84 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Figure 2.9

Suppose that we want to study the temperature range of a region for an entire month. Every day at noon we note the temperature and write this down in a log. A variety of statistical studies could be done with this data. We could find the mean or the median temperature for the month. We could construct a histogram displaying the number of days that temperatures reach a certain range of values. However, all of these methods ignore a portion of the data that we have collected.

One feature of the data that we may want to consider is that of time. Since each date is paired with the temperature reading for the day, we don‘t have to think of the data as being random. We can instead use the times given to impose a chronological order on the data. A graph that recognizes this ordering and displays the changing temperature as the month progresses is called a time series graph.

Constructing a Time Series Graph To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in the order in which they occur.

Chapter 2 | Descriptive Statistics 85

 

 

Example 2.12

The following data shows the Annual Consumer Price Index, each month, for ten years. Construct a time series graph for the Annual Consumer Price Index data only.

Year Jan Feb Mar Apr May Jun Jul

2003 181.7 183.1 184.2 183.8 183.5 183.7 183.9

2004 185.2 186.2 187.4 188.0 189.1 189.7 189.4

2005 190.7 191.8 193.3 194.6 194.4 194.5 195.4

2006 198.3 198.7 199.8 201.5 202.5 202.9 203.5

2007 202.416 203.499 205.352 206.686 207.949 208.352 208.299

2008 211.080 211.693 213.528 214.823 216.632 218.815 219.964

2009 211.143 212.193 212.709 213.240 213.856 215.693 215.351

2010 216.687 216.741 217.631 218.009 218.178 217.965 218.011

2011 220.223 221.309 223.467 224.906 225.964 225.722 225.922

2012 226.665 227.663 229.392 230.085 229.815 229.478 229.104

Table 2.18

Year Aug Sep Oct Nov Dec Annual

2003 184.6 185.2 185.0 184.5 184.3 184.0

2004 189.5 189.9 190.9 191.0 190.3 188.9

2005 196.4 198.8 199.2 197.6 196.8 195.3

2006 203.9 202.9 201.8 201.5 201.8 201.6

2007 207.917 208.490 208.936 210.177 210.036 207.342

2008 219.086 218.783 216.573 212.425 210.228 215.303

2009 215.834 215.969 216.177 216.330 215.949 214.537

2010 218.312 218.439 218.711 218.803 219.179 218.056

2011 226.545 226.889 226.421 226.230 225.672 224.939

2012 230.379 231.407 231.317 230.221 229.601 229.594

Table 2.19

86 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Solution 2.12

Figure 2.10

2.12 The following table is a portion of a data set from www.worldbank.org. Use the table to construct a time series graph for CO2 emissions for the United States.

CO2 Emissions

Ukraine United Kingdom United States

2003 352,259 540,640 5,681,664

2004 343,121 540,409 5,790,761

2005 339,029 541,990 5,826,394

2006 327,797 542,045 5,737,615

2007 328,357 528,631 5,828,697

2008 323,657 522,247 5,656,839

2009 272,176 474,579 5,299,563

Table 2.20

Uses of a Time Series Graph Time series graphs are important tools in various applications of statistics. When recording values of the same variable over an extended period of time, sometimes it is difficult to discern any trend or pattern. However, once the same data points are displayed graphically, some features jump out. Time series graphs make trends easy to spot.

2.3 | Measures of the Location of the Data The common measures of location are quartiles and percentiles

Quartiles are special percentiles. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median, M, is called both the second quartile and the 50th percentile.

To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.

Chapter 2 | Descriptive Statistics 87

 

 

Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75th percentile. That translates into a score of at least 1220.

Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the test scores are less (and not the same or less) than your score, it would be acceptable because removing one particular data value is not significant.

The median is a number that measures the “center” of the data. You can think of the median as the “middle value,” but it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data. 1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1 Ordered from smallest to largest: 1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5

Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.

6.8 + 7.2 2 = 7

The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set: 1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5

The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is two. 1; 1; 2; 2; 4; 6; 6.8

The number two, which is part of the data, is the first quartile. One-fourth of the entire sets of values are the same as or less than two and three-fourths of the values are more than two.

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.

The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this example.

The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1).

IQR = Q3 – Q1 The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile. Potential outliers always require further investigation.

NOTE

A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.

Example 2.13

For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars. 389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000

88 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Solution 2.13

Order the data from smallest to largest. 114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000

M = 488,800

Q1 = 230,500 + 387,000

2 = 308,750

Q3 = 639,000 + 659,000

2 = 649,000

IQR = 649,000 – 308,750 = 340,250

(1.5)(IQR) = (1.5)(340,250) = 510,375

Q1 – (1.5)(IQR) = 308,750 – 510,375 = –201,625

Q3 + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375

No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier.

2.13 For the following 11 salaries, calculate the IQR and determine if any salaries are outliers. The salaries are in dollars.

$33,000; $64,500; $28,000; $54,000; $72,000; $68,500; $69,000; $42,000; $54,000; $120,000; $40,500

Example 2.14

For the two data sets in the test scores example, find the following:

a. The interquartile range. Compare the two interquartile ranges.

b. Any outliers in either set.

Solution 2.14

The five number summary for the day and night classes is

Minimum Q1 Median Q3 Maximum

Day 32 56 74.5 82.5 99

Night 25.5 78 81 89 98

Table 2.21

a. The IQR for the day group is Q3 – Q1 = 82.5 – 56 = 26.5 The IQR for the night group is Q3 – Q1 = 89 – 78 = 11

The interquartile range (the spread or variability) for the day class is larger than the night class IQR. This suggests more variation will be found in the day class’s class test scores.

b. Day class outliers are found using the IQR times 1.5 rule. So, Q1 – IQR(1.5) = 56 – 26.5(1.5) = 16.25

Chapter 2 | Descriptive Statistics 89

 

 

Q3 + IQR(1.5) = 82.5 + 26.5(1.5) = 122.25 Since the minimum and maximum values for the day class are greater than 16.25 and less than 122.25, there are no outliers.

Night class outliers are calculated as:

Q1 – IQR (1.5) = 78 – 11(1.5) = 61.5 Q3 + IQR(1.5) = 89 + 11(1.5) = 105.5 For this class, any test score less than 61.5 is an outlier. Therefore, the scores of 45 and 25.5 are outliers. Since no test score is greater than 105.5, there is no upper end outlier.

2.14 Find the interquartile range for the following two data sets and compare them. Test Scores for Class A 69; 96; 81; 79; 65; 76; 83; 99; 89; 67; 90; 77; 85; 98; 66; 91; 77; 69; 80; 94 Test Scores for Class B 90; 72; 80; 92; 90; 97; 92; 75; 79; 68; 70; 80; 99; 95; 78; 73; 71; 68; 95; 100

Example 2.15

Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were:

AMOUNT OF SLEEP PER SCHOOL NIGHT (HOURS) FREQUENCY

RELATIVE FREQUENCY

CUMULATIVE RELATIVE FREQUENCY

4 2 0.04 0.04

5 5 0.10 0.14

6 7 0.14 0.28

7 12 0.24 0.52

8 14 0.28 0.80

9 7 0.14 0.94

10 3 0.06 1.00

Table 2.22

Find the 28th percentile. Notice the 0.28 in the “cumulative relative frequency” column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28th percentile. They include the two 4s, the five 5s, and the seven 6s. The 28th percentile is between the last six and the first seven. The 28th percentile is 6.5.

Find the median. Look again at the “cumulative relative frequency” column and find 0.52. The median is the 50th percentile or the second quartile. 50% of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and eleven of the 7s. The median or 50th percentile is between the 25th, or seven, and 26th, or seven, values. The median is seven.

Find the third quartile. The third quartile is the same as the 75th percentile. You can “eyeball” this answer. If you look at the “cumulative relative frequency” column, you find 0.52 and 0.80. When you have all the fours, fives,

90 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

sixes and sevens, you have 52% of the data. When you include all the 8s, you have 80% of the data. The 75th percentile, then, must be an eight. Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38th value, which is an eight. You can check this answer by counting the values. (There are 37 values below the third quartile and 12 values above.)

2.15 Forty bus drivers were asked how many hours they spend each day running their routes (rounded to the nearest hour). Find the 65th percentile.

Amount of time spent on route (hours) Frequency

Relative Frequency

Cumulative Relative Frequency

2 12 0.30 0.30

3 14 0.35 0.65

4 10 0.25 0.90

5 4 0.10 1.00

Table 2.23

Example 2.16

Using Table 2.22:

a. Find the 80th percentile.

b. Find the 90th percentile.

c. Find the first quartile. What is another name for the first quartile?

Solution 2.16

Using the data from the frequency table, we have:

a. The 80th percentile is between the last eight and the first nine in the table (between the 40th and 41st values). Therefore, we need to take the mean of the 40th an 41st values. The 80th percentile = 8 + 92 = 8.5

b. The 90th percentile will be the 45th data value (location is 0.90(50) = 45) and the 45th data value is nine.

c. Q1 is also the 25th percentile. The 25th percentile location calculation: P25 = 0.25(50) = 12.5 ≈ 13 the 13th data value. Thus, the 25th percentile is six.

2.16 Refer to the Table 2.23. Find the third quartile. What is another name for the third quartile?

Chapter 2 | Descriptive Statistics 91

 

 

Your instructor or a member of the class will ask everyone in class how many sweaters they own. Answer the following questions:

1. How many students were surveyed?

2. What kind of sampling did you do?

3. Construct two different histograms. For each, starting value = _____ ending value = ____.

4. Find the median, first quartile, and third quartile.

5. Construct a table of the data to find the following:

a. the 10th percentile

b. the 70th percentile

c. the percent of students who own less than four sweaters

A Formula for Finding the kth Percentile If you were to do a little research, you would find several formulas for calculating the kth percentile. Here is one of them.

k = the kth percentile. It may or may not be part of the data.

i = the index (ranking or position of a data value)

n = the total number of data

• Order the data from smallest to largest.

• Calculate i = k100(n + 1)

• If i is an integer, then the kth percentile is the data value in the ith position in the ordered set of data.

• If i is not an integer, then round i up and round i down to the nearest integers. Average the two data values in these two positions in the ordered data set. This is easier to understand in an example.

Example 2.17

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. 18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

a. Find the 70th percentile.

b. Find the 83rd percentile.

Solution 2.17

a. k = 70 i = the index n = 29

i = k100 (n + 1) = ( 70 100 )(29 + 1) = 21. Twenty-one is an integer, and the data value in the 21

st position in

the ordered data set is 64. The 70th percentile is 64 years.

b. k = 83rd percentile i = the index

92 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

n = 29

i = k100 (n + 1) = ) 83 100 )(29 + 1) = 24.9, which is NOT an integer. Round it down to 24 and up to 25. The

age in the 24th position is 71 and the age in the 25th position is 72. Average 71 and 72. The 83rd percentile is 71.5 years.

2.17 Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. 18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77 Calculate the 20th percentile and the 55th percentile.

NOTE

You can calculate percentiles using calculators and computers. There are a variety of online calculators.

A Formula for Finding the Percentile of a Value in a Data Set • Order the data from smallest to largest.

• x = the number of data values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile.

• y = the number of data values equal to the data value for which you want to find the percentile.

• n = the total number of data.

• Calculate x + 0.5yn (100). Then round to the nearest integer.

Example 2.18

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. 18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

a. Find the percentile for 58.

b. Find the percentile for 25.

Solution 2.18 a. Counting from the bottom of the list, there are 18 data values less than 58. There is one value of 58.

x = 18 and y = 1. x + 0.5yn (100) = 18 + 0.5(1)

29 (100) = 63.80. 58 is the 64 th percentile.

b. Counting from the bottom of the list, there are three data values less than 25. There is one value of 25.

x = 3 and y = 1. x + 0.5yn (100) = 3 + 0.5(1)

29 (100) = 12.07. Twenty-five is the 12 th percentile.

Chapter 2 | Descriptive Statistics 93

 

 

2.18 Listed are 30 ages for Academy Award winning best actors in order from smallest to largest. 18; 21; 22; 25; 26; 27; 29; 30; 31, 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77 Find the percentiles for 47 and 31.

Interpreting Percentiles, Quartiles, and Median A percentile indicates the relative standing of a data value when data are sorted into numerical order from smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example, 15% of data values are less than or equal to the 15th percentile.

• Low percentiles always correspond to lower data values.

• High percentiles always correspond to higher data values.

A percentile may or may not correspond to a value judgment about whether it is “good” or “bad.” The interpretation of whether a certain percentile is “good” or “bad” depends on the context of the situation to which the data applies. In some situations, a low percentile would be considered “good;” in other contexts a high percentile might be considered “good”. In many situations, there is no value judgment that applies.

Understanding how to interpret percentiles properly is important not only when describing data, but also when calculating probabilities in later chapters of this text.

NOTE

When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information.

• information about the context of the situation being considered

• the data value (value of the variable) that represents the percentile

• the percent of individuals or items with data values below the percentile

• the percent of individuals or items with data values above the percentile.

Example 2.19

On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.

Solution 2.19 • Twenty-five percent of students finished the exam in 35 minutes or less.

• Seventy-five percent of students finished the exam in 35 minutes or more.

• A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.)

2.19 For the 100-meter dash, the third quartile for times for finishing the race was 11.5 seconds. Interpret the third quartile in the context of the situation.

94 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Example 2.20

On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret the 70th percentile in the context of this situation.

Solution 2.20 • Seventy percent of students answered 16 or fewer questions correctly.

• Thirty percent of students answered 16 or more questions correctly.

• A higher percentile could be considered good, as answering more questions correctly is desirable.

2.20 On a 60 point written assignment, the 80th percentile for the number of points earned was 49. Interpret the 80th percentile in the context of this situation.

Example 2.21

At a community college, it was found that the 30th percentile of credit units that students are enrolled for is seven units. Interpret the 30th percentile in the context of this situation.

Solution 2.21 • Thirty percent of students are enrolled in seven or fewer credit units.

• Seventy percent of students are enrolled in seven or more credit units.

• In this example, there is no “good” or “bad” value judgment associated with a higher or lower percentile. Students attend community college for varied reasons and needs, and their course load varies according to their needs.

2.21 During a season, the 40th percentile for points scored per player in a game is eight. Interpret the 40th percentile in the context of this situation.

Example 2.22

Sharpe Middle School is applying for a grant that will be used to add fitness equipment to the gym. The principal surveyed 15 anonymous students to determine how many minutes a day the students spend exercising. The results from the 15 anonymous students are shown.

0 minutes; 40 minutes; 60 minutes; 30 minutes; 60 minutes

10 minutes; 45 minutes; 30 minutes; 300 minutes; 90 minutes;

30 minutes; 120 minutes; 60 minutes; 0 minutes; 20 minutes

Determine the following five values.

Min = 0 Q1 = 20

Chapter 2 | Descriptive Statistics 95

 

 

Med = 40 Q3 = 60 Max = 300

If you were the principal, would you be justified in purchasing new fitness equipment? Since 75% of the students exercise for 60 minutes or less daily, and since the IQR is 40 minutes (60 – 20 = 40), we know that half of the students surveyed exercise between 20 minutes and 60 minutes daily. This seems a reasonable amount of time spent exercising, so the principal would be justified in purchasing the new equipment.

However, the principal needs to be careful. The value 300 appears to be a potential outlier.

Q3 + 1.5(IQR) = 60 + (1.5)(40) = 120.

The value 300 is greater than 120 so it is a potential outlier. If we delete it and calculate the five values, we get the following values:

Min = 0 Q1 = 20 Q3 = 60 Max = 120

We still have 75% of the students exercising for 60 minutes or less daily and half of the students exercising between 20 and 60 minutes a day. However, 15 students is a small sample and the principal should survey more students to be sure of his survey results.

2.4 | Box Plots Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data. They also show how far the extreme values are from most of the data. A box plot is constructed from five values: the minimum value, the first quartile, the median, the third quartile, and the maximum value. We use these values to compare how close other data values are to them.

To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest and largest data values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile marks the other end of the box. Approximately the middle 50 percent of the data fall inside the box. The “whiskers” extend from the ends of the box to the smallest and largest data values. The median or second quartile can be between the first and third quartiles, or it can be one, or the other, or both. The box plot gives a good, quick picture of the data.

NOTE

You may encounter box-and-whisker plots that have dots marking outlier values. In those cases, the whiskers are not extending to the minimum and maximum values.

Consider, again, this dataset.

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5

The first quartile is two, the median is seven, and the third quartile is nine. The smallest value is one, and the largest value is 11.5. The following image shows the constructed box plot.

NOTE

See the calculator instructions on the TI web site (http://education.ti.com/educationportal/sites/US/ sectionHome/support.html) or in the appendix.

96 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Figure 2.11

The two whiskers extend from the first quartile to the smallest value and from the third quartile to the largest value. The median is shown with a dashed line.

NOTE

It is important to start a box plot with a scaled number line. Otherwise the box plot may not be useful.

Example 2.23

The following data are the heights of 40 students in a statistics class.

59; 60; 61; 62; 62; 63; 63; 64; 64; 64; 65; 65; 65; 65; 65; 65; 65; 65; 65; 66; 66; 67; 67; 68; 68; 69; 70; 70; 70; 70; 70; 71; 71; 72; 72; 73; 74; 74; 75; 77

Construct a box plot with the following properties; the calculator intructions for the minimum and maximum values as well as the quartiles follow the example.

• Minimum value = 59

• Maximum value = 77

• Q1: First quartile = 64.5

• Q2: Second quartile or median= 66

• Q3: Third quartile = 70

Figure 2.12

a. Each quarter has approximately 25% of the data.

b. The spreads of the four quarters are 64.5 – 59 = 5.5 (first quarter), 66 – 64.5 = 1.5 (second quarter), 70 – 66 = 4 (third quarter), and 77 – 70 = 7 (fourth quarter). So, the second quarter has the smallest spread and the fourth quarter has the largest spread.

c. Range = maximum value – the minimum value = 77 – 59 = 18

d. Interquartile Range: IQR = Q3 – Q1 = 70 – 64.5 = 5.5.

e. The interval 59–65 has more than 25% of the data so it has more data in it than the interval 66 through 70 which has 25% of the data.

f. The middle 50% (middle half) of the data has a range of 5.5 inches.

Chapter 2 | Descriptive Statistics 97

 

 

To find the minimum, maximum, and quartiles:

Enter data into the list editor (Pres STAT 1:EDIT). If you need to clear the list, arrow up to the name L1, press CLEAR, and then arrow down.

Put the data values into the list L1.

Press STAT and arrow to CALC. Press 1:1-VarStats. Enter L1.

Press ENTER.

Use the down and up arrow keys to scroll.

Smallest value = 59.

Largest value = 77.

Q1: First quartile = 64.5.

Q2: Second quartile or median = 66.

Q3: Third quartile = 70.

To construct the box plot:

Press 4:Plotsoff. Press ENTER.

Arrow down and then use the right arrow key to go to the fifth picture, which is the box plot. Press ENTER.

Arrow down to Xlist: Press 2nd 1 for L1

Arrow down to Freq: Press ALPHA. Press 1.

Press Zoom. Press 9: ZoomStat.

Press TRACE, and use the arrow keys to examine the box plot.

2.23 The following data are the number of pages in 40 books on a shelf. Construct a box plot using a graphing calculator, and state the interquartile range.

136; 140; 178; 190; 205; 215; 217; 218; 232; 234; 240; 255; 270; 275; 290; 301; 303; 315; 317; 318; 326; 333; 343; 349; 360; 369; 377; 388; 391; 392; 398; 400; 402; 405; 408; 422; 429; 450; 475; 512

For some sets of data, some of the largest value, smallest value, first quartile, median, and third quartile may be the same. For instance, you might have a data set in which the median and the third quartile are the same. In this case, the diagram would not have a dotted line inside the box displaying the median. The right side of the box would display both the third quartile and the median. For example, if the smallest value and the first quartile were both one, the median and the third quartile were both five, and the largest value was seven, the box plot would look like:

Figure 2.13

In this case, at least 25% of the values are equal to one. Twenty-five percent of the values are between one and five,

98 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

inclusive. At least 25% of the values are equal to five. The top 25% of the values fall between five and seven, inclusive.

Example 2.24

Test scores for a college statistics class held during the day are:

99; 56; 78; 55.5; 32; 90; 80; 81; 56; 59; 45; 77; 84.5; 84; 70; 72; 68; 32; 79; 90

Test scores for a college statistics class held during the evening are:

98; 78; 68; 83; 81; 89; 88; 76; 65; 45; 98; 90; 80; 84.5; 85; 79; 78; 98; 90; 79; 81; 25.5

a. Find the smallest and largest values, the median, and the first and third quartile for the day class.

b. Find the smallest and largest values, the median, and the first and third quartile for the night class.

c. For each data set, what percentage of the data is between the smallest value and the first quartile? the first quartile and the median? the median and the third quartile? the third quartile and the largest value? What percentage of the data is between the first quartile and the largest value?

d. Create a box plot for each set of data. Use one number line for both box plots.

e. Which box plot has the widest spread for the middle 50% of the data (the data between the first and third quartiles)? What does this mean for that set of data in comparison to the other set of data?

Solution 2.24

a. Min = 32 Q1 = 56 M = 74.5 Q3 = 82.5 Max = 99

b. Min = 25.5 Q1 = 78 M = 81 Q3 = 89 Max = 98

c. Day class: There are six data values ranging from 32 to 56: 30%. There are six data values ranging from 56 to 74.5: 30%. There are five data values ranging from 74.5 to 82.5: 25%. There are five data values ranging from 82.5 to 99: 25%. There are 16 data values between the first quartile, 56, and the largest value, 99: 75%. Night class:

d. Figure 2.14

e. The first data set has the wider spread for the middle 50% of the data. The IQR for the first data set is greater than the IQR for the second set. This means that there is more variability in the middle 50% of the first data set.

Chapter 2 | Descriptive Statistics 99

 

 

2.24 The following data set shows the heights in inches for the boys in a class of 40 students. 66; 66; 67; 67; 68; 68; 68; 68; 68; 69; 69; 69; 70; 71; 72; 72; 72; 73; 73; 74 The following data set shows the heights in inches for the girls in a class of 40 students. 61; 61; 62; 62; 63; 63; 63; 65; 65; 65; 66; 66; 66; 67; 68; 68; 68; 69; 69; 69 Construct a box plot using a graphing calculator for each data set, and state which box plot has the wider spread for the middle 50% of the data.

Example 2.25

Graph a box-and-whisker plot for the data values shown.

10; 10; 10; 15; 35; 75; 90; 95; 100; 175; 420; 490; 515; 515; 790

The five numbers used to create a box-and-whisker plot are:

Min: 10 Q1: 15 Med: 95 Q3: 490 Max: 790

The following graph shows the box-and-whisker plot.

Figure 2.15

2.25 Follow the steps you used to graph a box-and-whisker plot for the data values shown. 0; 5; 5; 15; 30; 30; 45; 50; 50; 60; 75; 110; 140; 240; 330

2.5 | Measures of the Center of the Data The “center” of a data set is also a way of describing location. The two most widely used measures of the “center” of the data are the mean (average) and the median. To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts. The median is generally a better measure of the center when there are extreme values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center.

NOTE

The words “mean” and “average” are often used interchangeably. The substitution of one word for the other is common

100 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

practice. The technical term is “arithmetic mean” and “average” is technically a center location. However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.”

When each value in the data set is not unique, the mean can be calculated by multiplying each distinct value by its frequency and then dividing the sum by the total number of data values. The letter used to represent the sample mean is an x with a bar over it (pronounced “x bar”): x¯ .

The Greek letter μ (pronounced “mew”) represents the population mean. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.

To see that both ways of calculating the mean are the same, consider the sample: 1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4

x¯ = 1 + 1 + 1 + 2 + 2 + 3 + 4 + 4 + 4 + 4 + 411 = 2.7

x̄ = 3(1) + 2(2) + 1(3) + 5(4)11 = 2.7

In the second calculation, the frequencies are 3, 2, 1, and 5.

You can quickly find the location of the median by using the expression n + 12 .

The letter n is the total number of data values in the sample. If n is an odd number, the median is the middle value of the ordered data (ordered smallest to largest). If n is an even number, the median is equal to the two middle values added together and divided by two after the data has been ordered. For example, if the total number of data values is 97, then n + 1

2 = 97 + 1

2 = 49. The median is the 49 th value in the ordered data. If the total number of data values is 100, then

n + 1 2 =

100 + 1 2 = 50.5. The median occurs midway between the 50

th and 51st values. The location of the median and

the value of the median are not the same. The upper case letter M is often used to represent the median. The next example illustrates the location of the median and the value of the median.

Example 2.26

AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody drug are as follows (smallest to largest): 3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29; 31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47; Calculate the mean and the median.

Solution 2.26

The calculation for the mean is:

x¯ = [3 + 4 + (8)(2) + 10 + 11 + 12 + 13 + 14 + (15)(2) + (16)(2) + … + 35 + 37 + 40 + (44)(2) + 47]40 = 23.6

To find the median, M, first use the formula for the location. The location is: n + 1

2 = 40 + 1

2 = 20.5

Starting at the smallest value, the median is located between the 20th and 21st values (the two 24s): 3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29; 31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47;

M = 24 + 242 = 24

Chapter 2 | Descriptive Statistics 101

 

 

To find the mean and the median:

Clear list L1. Pres STAT 4:ClrList. Enter 2nd 1 for list L1. Press ENTER.

Enter data into the list editor. Press STAT 1:EDIT.

Put the data values into list L1.

Press STAT and arrow to CALC. Press 1:1-VarStats. Press 2nd 1 for L1 and then ENTER.

Press the down and up arrow keys to scroll.

x̄ = 23.6, M = 24

2.26 The following data show the number of months patients typically wait on a transplant list before getting surgery. The data are ordered from smallest to largest. Calculate the mean and median.

3; 4; 5; 7; 7; 7; 7; 8; 8; 9; 9; 10; 10; 10; 10; 10; 11; 12; 12; 13; 14; 14; 15; 15; 17; 17; 18; 19; 19; 19; 21; 21; 22; 22; 23; 24; 24; 24; 24

Example 2.27

Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 each earn $30,000. Which is the better measure of the “center”: the mean or the median?

Solution 2.27

x̄ = 5, 000, 000 + 49(30, 000)50 = 129,400

M = 30,000

(There are 49 people who earn $30,000 and one person who earns $5,000,000.)

The median is a better measure of the “center” than the mean because 49 of the values are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle of the data.

2.27 In a sample of 60 households, one house is worth $2,500,000. Half of the rest are worth $280,000, and all the others are worth $315,000. Which is the better measure of the “center”: the mean or the median?

Another measure of the center is the mode. The mode is the most frequent value. There can be more than one mode in a data set as long as those values have the same frequency and that frequency is the highest. A data set with two modes is called bimodal.

102 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Example 2.28

Statistics exam scores for 20 students are as follows:

50; 53; 59; 59; 63; 63; 72; 72; 72; 72; 72; 76; 78; 81; 83; 84; 84; 84; 90; 93

Find the mode.

Solution 2.28 The most frequent score is 72, which occurs five times. Mode = 72.

2.28 The number of books checked out from the library from 25 students are as follows: 0; 0; 0; 1; 2; 3; 3; 4; 4; 5; 5; 7; 7; 7; 7; 8; 8; 8; 9; 10; 10; 11; 11; 12; 12 Find the mode.

Example 2.29

Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 430 and 480 each occur twice.

When is the mode the best measure of the “center”? Consider a weight loss program that advertises a mean weight loss of six pounds the first week of the program. The mode might indicate that most people lose two pounds the first week, making the program less appealing.

NOTE

The mode can be calculated for qualitative data as well as for quantitative data. For example, if the data set is: red, red, red, green, green, yellow, purple, black, blue, the mode is red.

Statistical software will easily calculate the mean, the median, and the mode. Some graphing calculators can also make these calculations. In the real world, people make these calculations using software.

2.29 Five credit scores are 680, 680, 700, 720, 720. The data set is bimodal because the scores 680 and 720 each occur twice. Consider the annual earnings of workers at a factory. The mode is $25,000 and occurs 150 times out of 301. The median is $50,000 and the mean is $47,500. What would be the best measure of the “center”?

The Law of Large Numbers and the Mean The Law of Large Numbers says that if you take samples of larger and larger size from any population, then the mean x¯

of the sample is very likely to get closer and closer to µ. This is discussed in more detail later in the text.

Sampling Distributions and Statistic of a Sampling Distribution You can think of a sampling distribution as a relative frequency distribution with a great many samples. (See Sampling and Data for a review of relative frequency). Suppose thirty randomly selected students were asked the number of movies they watched the previous week. The results are in the relative frequency table shown below.

Chapter 2 | Descriptive Statistics 103

 

 

# of movies Relative Frequency

0 5 30

1 15 30

2 6 30

3 3 30

4 1 30

Table 2.24

If you let the number of samples get very large (say, 300 million or more), the relative frequency table becomes a relative frequency distribution.

A statistic is a number calculated from a sample. Statistic examples include the mean, the median and the mode as well as others. The sample mean x¯ is an example of a statistic which estimates the population mean μ.

Calculating the Mean of Grouped Frequency Tables When only grouped data is available, you do not know the individual data values (we only know intervals and interval frequencies); therefore, you cannot compute an exact mean for the data set. What we must do is estimate the actual mean by calculating the mean of a frequency table. A frequency table is a data representation in which grouped data is displayed along with the corresponding frequencies. To calculate the mean from a grouped frequency table we can apply the basic definition of mean: mean = data sumnumber o f data values We simply need to modify the definition to fit within the restrictions

of a frequency table.

Since we do not know the individual data values we can instead find the midpoint of each interval. The midpoint

is lower boundary + upper boundary2 . We can now modify the mean definition to be

Mean o f Frequency Table = ∑ f m ∑ f

where f = the frequency of the interval and m = the midpoint of the interval.

104 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Example 2.30

A frequency table displaying professor Blount’s last statistic test is shown. Find the best estimate of the class mean.

Grade Interval Number of Students

50–56.5 1

56.5–62.5 0

62.5–68.5 4

68.5–74.5 4

74.5–80.5 2

80.5–86.5 3

86.5–92.5 4

92.5–98.5 1

Table 2.25

Solution 2.30 • Find the midpoints for all intervals

Grade Interval Midpoint

50–56.5 53.25

56.5–62.5 59.5

62.5–68.5 65.5

68.5–74.5 71.5

74.5–80.5 77.5

80.5–86.5 83.5

86.5–92.5 89.5

92.5–98.5 95.5

Table 2.26

• Calculate the sum of the product of each interval frequency and midpoint. ∑ f m

53.25(1) + 59.5(0) + 65.5(4) + 71.5(4) + 77.5(2) + 83.5(3) + 89.5(4) + 95.5(1) = 1460.25

• µ = ∑ f m ∑ f

= 1460.2519 = 76.86

Chapter 2 | Descriptive Statistics 105

 

 

2.30 Maris conducted a study on the effect that playing video games has on memory recall. As part of her study, she compiled the following data:

Hours Teenagers Spend on Video Games Number of Teenagers

0–3.5 3

3.5–7.5 7

7.5–11.5 12

11.5–15.5 7

15.5–19.5 9

Table 2.27

What is the best estimate for the mean number of hours spent playing video games?

2.6 | Skewness and the Mean, Median, and Mode Consider the following data set. 4; 5; 6; 6; 6; 7; 7; 7; 7; 7; 7; 8; 8; 8; 9; 10

This data set can be represented by following histogram. Each interval has width one, and each value is located in the middle of an interval.

Figure 2.16

The histogram displays a symmetrical distribution of data. A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. The mean, the median, and the mode are each seven for these data. In a perfectly symmetrical distribution, the mean and the median are the same. This example has one mode (unimodal), and the mode is the same as the mean and median. In a symmetrical distribution that has two modes (bimodal), the two modes would be different from the mean and median.

The histogram for the data: 4; 5; 6; 6; 6; 7; 7; 7; 7; 8 is not symmetrical. The right-hand side seems “chopped off” compared to the left side. A distribution of this type is called skewed to the left because it is pulled out to the left.

106 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Figure 2.17

The mean is 6.3, the median is 6.5, and the mode is seven. Notice that the mean is less than the median, and they are both less than the mode. The mean and the median both reflect the skewing, but the mean reflects it more so.

The histogram for the data: 6; 7; 7; 7; 7; 8; 8; 8; 9; 10, is also not symmetrical. It is skewed to the right.

Figure 2.18

The mean is 7.7, the median is 7.5, and the mode is seven. Of the three statistics, the mean is the largest, while the mode is the smallest. Again, the mean reflects the skewing the most.

To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, which is often less than the mode. If the distribution of data is skewed to the right, the mode is often less than the median, which is less than the mean.

Skewness and symmetry become important when we discuss probability distributions in later chapters.

Example 2.31

Statistics are used to compare and sometimes identify authors. The following lists shows a simple random sample that compares the letter counts for three authors.

Chapter 2 | Descriptive Statistics 107

 

 

Terry: 7; 9; 3; 3; 3; 4; 1; 3; 2; 2

Davis: 3; 3; 3; 4; 1; 4; 3; 2; 3; 1

Maris: 2; 3; 4; 4; 4; 6; 6; 6; 8; 3

a. Make a dot plot for the three authors and compare the shapes.

b. Calculate the mean for each.

c. Calculate the median for each.

d. Describe any pattern you notice between the shape and the measures of center.

Solution 2.31

a. Figure 2.19 Terry’s distribution has a right (positive) skew.

Figure 2.20 Davis’ distribution has a left (negative) skew

Figure 2.21 Maris’ distribution is symmetrically shaped.

b. Terry’s mean is 3.7, Davis’ mean is 2.7, Maris’ mean is 4.6.

c. Terry’s median is three, Davis’ median is three. Maris’ median is four.

d. It appears that the median is always closest to the high point (the mode), while the mean tends to be farther out on the tail. In a symmetrical distribution, the mean and the median are both centrally located close to the high point of the distribution.

108 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

2.31 Discuss the mean, median, and mode for each of the following problems. Is there a pattern between the shape and measure of the center?

a.

Figure 2.22

b.

The Ages Former U.S Presidents Died

4 6 9

5 3 6 7 7 7 8

6 0 0 3 3 4 4 5 6 7 7 7 8

7 0 1 1 2 3 4 7 8 8 9

8 0 1 3 5 8

9 0 0 3 3

Key: 8|0 means 80.

Table 2.28

c.

Figure 2.23

Chapter 2 | Descriptive Statistics 109

 

 

2.7 | Measures of the Spread of the Data An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from their mean.

The standard deviation • provides a numerical measure of the overall amount of variation in a data set, and

• can be used to determine whether a particular data value is close to or far from the mean.

The standard deviation provides a measure of the overall variation in a data set The standard deviation is always positive or zero. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.

Suppose that we are studying the amount of time customers wait in line at the checkout at supermarket A and supermarket B. the average wait time at both supermarkets is five minutes. At supermarket A, the standard deviation for the wait time is two minutes; at supermarket B the standard deviation for the wait time is four minutes.

Because supermarket B has a higher standard deviation, we know that there is more variation in the wait times at supermarket B. Overall, wait times at supermarket B are more spread out from the average; wait times at supermarket A are more concentrated near the average.

The standard deviation can be used to determine whether a data value is close to or far from the mean. Suppose that Rosa and Binh both shop at supermarket A. Rosa waits at the checkout counter for seven minutes and Binh waits for one minute. At supermarket A, the mean waiting time is five minutes and the standard deviation is two minutes. The standard deviation can be used to determine whether a data value is close to or far from the mean.

Rosa waits for seven minutes:

• Seven is two minutes longer than the average of five; two minutes is equal to one standard deviation.

• Rosa’s wait time of seven minutes is two minutes longer than the average of five minutes.

• Rosa’s wait time of seven minutes is one standard deviation above the average of five minutes.

Binh waits for one minute.

• One is four minutes less than the average of five; four minutes is equal to two standard deviations.

• Binh’s wait time of one minute is four minutes less than the average of five minutes.

• Binh’s wait time of one minute is two standard deviations below the average of five minutes.

• A data value that is two standard deviations from the average is just on the borderline for what many statisticians would consider to be far from the average. Considering data to be far from the mean if it is more than two standard deviations away is more of an approximate “rule of thumb” than a rigid rule. In general, the shape of the distribution of the data affects how much of the data is further away than two standard deviations. (You will learn more about this in later chapters.)

The number line may help you understand standard deviation. If we were to put five and seven on a number line, seven is to the right of five. We say, then, that seven is one standard deviation to the right of five because 5 + (1)(2) = 7.

If one were also part of the data set, then one is two standard deviations to the left of five because 5 + (–2)(2) = 1.

Figure 2.24

• In general, a value = mean + (#ofSTDEV)(standard deviation)

• where #ofSTDEVs = the number of standard deviations

• #ofSTDEV does not need to be an integer

110 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

• One is two standard deviations less than the mean of five because: 1 = 5 + (–2)(2).

The equation value = mean + (#ofSTDEVs)(standard deviation) can be expressed for a sample and for a population.

• sample: x = x̄ + ( # o f STDEV)(s)

• Population: x = µ + ( # o f STDEV)(σ)

The lower case letter s represents the sample standard deviation and the Greek letter σ (sigma, lower case) represents the population standard deviation.

The symbol x¯ is the sample mean and the Greek symbol µ is the population mean.

Calculating the Standard Deviation If x is a number, then the difference “x – mean” is called its deviation. In a data set, there are as many deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong to a population, in symbols a deviation is x – μ. For sample data, in symbols a deviation is x – x̄ .

The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample. The calculations are similar, but not identical. Therefore the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s represents the sample standard deviation and the Greek letter σ (sigma, lower case) represents the population standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate of σ.

To calculate the standard deviation, we need to calculate the variance first. The variance is the average of the squares of the deviations (the x – x̄ values for a sample, or the x – μ values for a population). The symbol σ2 represents the population variance; the population standard deviation σ is the square root of the population variance. The symbol s2 represents the sample variance; the sample standard deviation s is the square root of the sample variance. You can think of the standard deviation as a special average of the deviations.

If the numbers come from a census of the entire population and not a sample, when we calculate the average of the squared deviations to find the variance, we divide by N, the number of items in the population. If the data are from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n – 1, one less than the number of items in the sample.

Formulas for the Sample Standard Deviation

• s = Σ(x − x̄ ) 2

n − 1 or s = Σ f (x − x̄ )

2

n − 1

• For the sample standard deviation, the denominator is n – 1, that is the sample size MINUS 1.

Formulas for the Population Standard Deviation

• σ = Σ(x − µ) 2

N or σ = Σ f (x – µ)2

N

• For the population standard deviation, the denominator is N, the number of items in the population.

In these formulas, f represents the frequency with which a value appears. For example, if a value appears once, f is one. If a value appears three times in the data set or population, f is three.

Sampling Variability of a Statistic The statistic of a sampling distribution was discussed in Descriptive Statistics: Measuring the Center of the Data. How much the statistic varies from one sample to another is known as the sampling variability of a statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of the mean is an example of a standard error. It is a special standard deviation and is known as the standard deviation of the sampling distribution of the mean. You will cover the standard error of the mean in the chapter The Central Limit Theorem (not now). The notation for the standard error of the mean is σn where σ is the standard deviation of the population and n is the size of the sample.

Chapter 2 | Descriptive Statistics 111

 

 

NOTE

In practice, USE A CALCULATOR OR COMPUTER SOFTWARE TO CALCULATE THE STANDARD DEVIATION. If you are using a TI-83, 83+, 84+ calculator, you need to select the appropriate standard

deviation σx or sx from the summary statistics. We will concentrate on using and interpreting the information that the standard deviation gives us. However you should study the following step-by-step example to help you understand how the standard deviation measures variation from the mean. (The calculator instructions appear at the end of this example.)

Example 2.32

In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages of her students. The following data are the ages for a SAMPLE of n = 20 fifth grade students. The ages are rounded to the nearest half year:

9; 9.5; 9.5; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5;

x̄ = 9 + 9.5(2) + 10(4) + 10.5(4) + 11(6) + 11.5(3)20 = 10.525

The average age is 10.53 years, rounded to two places.

The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square root of the variance. We will explain the parts of the table after calculating s.

Data Freq. Deviations Deviations2 (Freq.)(Deviations2)

x f (x – x̄ ) (x – x̄ )2 (f)(x – x̄ )2

9 1 9 – 10.525 = –1.525 (–1.525)2 = 2.325625 1 × 2.325625 = 2.325625

9.5 2 9.5 – 10.525 = –1.025 (–1.025)2 = 1.050625 2 × 1.050625 = 2.101250

10 4 10 – 10.525 = –0.525 (–0.525)2 = 0.275625 4 × 0.275625 = 1.1025

10.5 4 10.5 – 10.525 = –0.025 (–0.025)2 = 0.000625 4 × 0.000625 = 0.0025

11 6 11 – 10.525 = 0.475 (0.475)2 = 0.225625 6 × 0.225625 = 1.35375

11.5 3 11.5 – 10.525 = 0.975 (0.975)2 = 0.950625 3 × 0.950625 = 2.851875

The total is 9.7375

Table 2.29

The sample variance, s2, is equal to the sum of the last column (9.7375) divided by the total number of data values minus one (20 – 1):

s2 = 9.737520 − 1 = 0.5125

The sample standard deviation s is equal to the square root of the sample variance:

s = 0.5125 = 0.715891, which is rounded to two decimal places, s = 0.72.

Typically, you do the calculation for the standard deviation on your calculator or computer. The intermediate results are not rounded. This is done for accuracy.

• For the following problems, recall that value = mean + (#ofSTDEVs)(standard deviation). Verify the mean and standard deviation or a calculator or computer.

• For a sample: x = x̄ + (#ofSTDEVs)(s)

112 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

• For a population: x = μ + (#ofSTDEVs)(σ)

• For this example, use x = x̄ + (#ofSTDEVs)(s) because the data is from a sample

a. Verify the mean and standard deviation on your calculator or computer.

b. Find the value that is one standard deviation above the mean. Find ( x̄ + 1s).

c. Find the value that is two standard deviations below the mean. Find ( x̄ – 2s).

d. Find the values that are 1.5 standard deviations from (below and above) the mean.

Solution 2.32 a. ◦ Clear lists L1 and L2. Press STAT 4:ClrList. Enter 2nd 1 for L1, the comma (,), and 2nd 2 for L2.

◦ Enter data into the list editor. Press STAT 1:EDIT. If necessary, clear the lists by arrowing up into the name. Press CLEAR and arrow down.

◦ Put the data values (9, 9.5, 10, 10.5, 11, 11.5) into list L1 and the frequencies (1, 2, 4, 4, 6, 3) into list L2. Use the arrow keys to move around.

◦ Press STAT and arrow to CALC. Press 1:1-VarStats and enter L1 (2nd 1), L2 (2nd 2). Do not forget the comma. Press ENTER.

◦ x̄ = 10.525

◦ Use Sx because this is sample data (not a population): Sx=0.715891

b. ( x̄ + 1s) = 10.53 + (1)(0.72) = 11.25

c. ( x̄ – 2s) = 10.53 – (2)(0.72) = 9.09

d. ◦ ( x̄ – 1.5s) = 10.53 – (1.5)(0.72) = 9.45

◦ ( x̄ + 1.5s) = 10.53 + (1.5)(0.72) = 11.61

2.32 On a baseball team, the ages of each of the players are as follows: 21; 21; 22; 23; 24; 24; 25; 25; 28; 29; 29; 31; 32; 33; 33; 34; 35; 36; 36; 36; 36; 38; 38; 38; 40

Use your calculator or computer to find the mean and standard deviation. Then find the value that is two standard deviations above the mean.

Explanation of the standard deviation calculation shown in the table The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than is the data value 11 which is indicated by the deviations 0.97 and 0.47. A positive deviation occurs when the data value is greater than the mean, whereas a negative deviation occurs when the data value is less than the mean. The deviation is –1.525 for the data value nine. If you add the deviations, the sum is always zero. (For Example 2.32, there are n = 20 deviations.) So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you make them positive numbers, and the sum will also be positive. The variance, then, is the average squared deviation.

The variance is a squared measure and does not have the same units as the data. Taking the square root solves the problem. The standard deviation measures the spread in the same units as the data.

Notice that instead of dividing by n = 20, the calculation divided by n – 1 = 20 – 1 = 19 because the data is a sample.

Chapter 2 | Descriptive Statistics 113

 

 

For the sample variance, we divide by the sample size minus one (n – 1). Why not divide by n? The answer has to do with the population variance. The sample variance is an estimate of the population variance. Based on the theoretical mathematics that lies behind these calculations, dividing by (n – 1) gives a better estimate of the population variance.

NOTE

Your concentration should be on what the standard deviation tells us about the data. The standard deviation is a number which measures how far the data are spread from the mean. Let a calculator or computer do the

arithmetic.

The standard deviation, s or σ, is either zero or larger than zero. Describing the data with reference to the spread is called “variability”. The variability in data depends upon the method by which the outcomes are obtained; for example, by measuring or by random sampling. When the standard deviation is zero, there is no spread; that is, the all the data values are equal to each other. The standard deviation is small when the data are all concentrated close to the mean, and is larger when the data values show more variation from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make s or σ very large.

The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better “feel” for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation can be very helpful but in skewed distributions, the standard deviation may not be much help. The reason is that the two sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look at the first quartile, the median, the third quartile, the smallest value, and the largest value. Because numbers can be confusing, always graph your data. Display your data in a histogram or a box plot.

Example 2.33

Use the following data (first exam scores) from Susan Dean’s spring pre-calculus class:

33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90; 92; 94; 94; 94; 94; 96; 100

a. Create a chart containing the data, frequencies, relative frequencies, and cumulative relative frequencies to three decimal places.

b. Calculate the following to one decimal place using a TI-83+ or TI-84 calculator:

i. The sample mean

ii. The sample standard deviation

iii. The median

iv. The first quartile

v. The third quartile

vi. IQR

c. Construct a box plot and a histogram on the same set of axes. Make comments about the box plot, the histogram, and the chart.

Solution 2.33 a. See Table 2.30

b. i. The sample mean = 73.5

ii. The sample standard deviation = 17.9

iii. The median = 73

iv. The first quartile = 61

v. The third quartile = 90

vi. IQR = 90 – 61 = 29

114 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

c. The x-axis goes from 32.5 to 100.5; y-axis goes from –2.4 to 15 for the histogram. The number of intervals is five, so the width of an interval is (100.5 – 32.5) divided by five, is equal to 13.6. Endpoints of the intervals are as follows: the starting point is 32.5, 32.5 + 13.6 = 46.1, 46.1 + 13.6 = 59.7, 59.7 + 13.6 = 73.3, 73.3 + 13.6 = 86.9, 86.9 + 13.6 = 100.5 = the ending value; No data values fall on an interval boundary.

Figure 2.25

The long left whisker in the box plot is reflected in the left side of the histogram. The spread of the exam scores in the lower 50% is greater (73 – 33 = 40) than the spread in the upper 50% (100 – 73 = 27). The histogram, box plot, and chart all reflect this. There are a substantial number of A and B grades (80s, 90s, and 100). The histogram clearly shows this. The box plot shows us that the middle 50% of the exam scores (IQR = 29) are Ds, Cs, and Bs. The box plot also shows us that the lower 25% of the exam scores are Ds and Fs.

Data Frequency Relative Frequency Cumulative Relative Frequency

33 1 0.032 0.032

42 1 0.032 0.064

49 2 0.065 0.129

53 1 0.032 0.161

55 2 0.065 0.226

61 1 0.032 0.258

63 1 0.032 0.29

67 1 0.032 0.322

68 2 0.065 0.387

69 2 0.065 0.452

72 1 0.032 0.484

73 1 0.032 0.516

74 1 0.032 0.548

78 1 0.032 0.580

80 1 0.032 0.612

Table 2.30

Chapter 2 | Descriptive Statistics 115

 

 

Data Frequency Relative Frequency Cumulative Relative Frequency

83 1 0.032 0.644

88 3 0.097 0.741

90 1 0.032 0.773

92 1 0.032 0.805

94 4 0.129 0.934

96 1 0.032 0.966

100 1 0.032 0.998 (Why isn’t this value 1?)

Table 2.30

2.33 The following data show the different types of pet food stores in the area carry. 6; 6; 6; 6; 7; 7; 7; 7; 7; 8; 9; 9; 9; 9; 10; 10; 10; 10; 10; 11; 11; 11; 11; 12; 12; 12; 12; 12; 12; Calculate the sample mean and the sample standard deviation to one decimal place using a TI-83+ or TI-84 calculator.

Standard deviation of Grouped Frequency Tables Recall that for grouped data we do not know individual data values, so we cannot describe the typical value of the data with precision. In other words, we cannot find the exact mean, median, or mode. We can, however, determine the best estimate of

the measures of center by finding the mean of the grouped data with the formula: Mean o f Frequency Table = ∑ f m ∑ f

where f = interval frequencies and m = interval midpoints.

Just as we could not find the exact mean, neither can we find the exact standard deviation. Remember that standard deviation describes numerically the expected deviation a data value has from the mean. In simple English, the standard deviation allows us to compare how “unusual” individual data is compared to the mean.

Example 2.34

Find the standard deviation for the data in Table 2.31.

Class Frequency, f Midpoint, m m2 x̄ 2 fm2 Standard Deviation

0–2 1 1 1 7.58 1 3.5

3–5 6 4 16 7.58 96 3.5

6–8 10 7 49 7.58 490 3.5

9–11 7 10 100 7.58 700 3.5

12–14 0 13 169 7.58 0 3.5

15–17 2 16 256 7.58 512 3.5

Table 2.31

116 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

For this data set, we have the mean, x̄ = 7.58 and the standard deviation, sx = 3.5. This means that a randomly selected data value would be expected to be 3.5 units from the mean. If we look at the first class, we see that the class midpoint is equal to one. This is almost two full standard deviations from the mean since 7.58 – 3.5 – 3.5

= 0.58. While the formula for calculating the standard deviation is not complicated, sx = f (m − x̄ )

2

n − 1 where sx

= sample standard deviation, x̄ = sample mean, the calculations are tedious. It is usually best to use technology when performing the calculations.

2.34 Find the standard deviation for the data from the previous example

Class Frequency, f

0–2 1

3–5 6

6–8 10

9–11 7

12–14 0

15–17 2

Table 2.32

First, press the STAT key and select 1:Edit

Figure 2.26

Input the midpoint values into L1 and the frequencies into L2

Chapter 2 | Descriptive Statistics 117

 

 

Figure 2.27

Select STAT, CALC, and 1: 1-Var Stats

Figure 2.28

Select 2nd then 1 then , 2nd then 2 Enter

Figure 2.29

You will see displayed both a population standard deviation, σx, and the sample standard deviation, sx.

Comparing Values from Different Data Sets The standard deviation is useful when comparing data values that come from different data sets. If the data sets have different means and standard deviations, then comparing the data values directly can be misleading.

• For each data value, calculate how many standard deviations away from its mean the value is.

• Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve for #ofSTDEVs.

118 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

• # o f STDEVs = value – meanstandard deviation

• Compare the results of this calculation.

#ofSTDEVs is often called a “z-score”; we can use the symbol z. In symbols, the formulas become:

Sample x = x¯ + zs z = x − x̄s

Population x = µ + zσ z = x − µ

σ

Table 2.33

Example 2.35

Two students, John and Ali, from different high schools, wanted to find out who had the highest GPA when compared to his school. Which student had the highest GPA when compared to his school?

Student GPA School Mean GPA School Standard Deviation

John 2.85 3.0 0.7

Ali 77 80 10

Table 2.34

Solution 2.35

For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from the average, for his school. Pay careful attention to signs when comparing and interpreting the answer.

z = # of STDEVs = value – meanstandard deviation = x – µ

σ

For John, z = # o f STDEVs = 2.85 – 3.00.7 = – 0.21

For Ali, z = # o f STDEVs = 77 − 8010 = − 0.3

John has the better GPA when compared to his school because his GPA is 0.21 standard deviations below his school’s mean while Ali’s GPA is 0.3 standard deviations below his school’s mean.

John’s z-score of –0.21 is higher than Ali’s z-score of –0.3. For GPA, higher values are better, so we conclude that John has the better GPA when compared to his school.

2.35 Two swimmers, Angie and Beth, from different teams, wanted to find out who had the fastest time for the 50 meter freestyle when compared to her team. Which swimmer had the fastest time when compared to her team?

Chapter 2 | Descriptive Statistics 119

 

 

Swimmer Time (seconds) Team Mean Time Team Standard Deviation

Angie 26.2 27.2 0.8

Beth 27.3 30.1 1.4

Table 2.35

The following lists give a few facts that provide a little more insight into what the standard deviation tells us about the distribution of the data.

For ANY data set, no matter what the distribution of the data is: • At least 75% of the data is within two standard deviations of the mean.

• At least 89% of the data is within three standard deviations of the mean.

• At least 95% of the data is within 4.5 standard deviations of the mean.

• This is known as Chebyshev’s Rule.

For data having a distribution that is BELL-SHAPED and SYMMETRIC: • Approximately 68% of the data is within one standard deviation of the mean.

• Approximately 95% of the data is within two standard deviations of the mean.

• More than 99% of the data is within three standard deviations of the mean.

• This is known as the Empirical Rule.

• It is important to note that this rule only applies when the shape of the distribution of the data is bell-shaped and symmetric. We will learn more about this when studying the “Normal” or “Gaussian” probability distribution in later chapters.

2.8 | Descriptive Statistics

120 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

2.1 Descriptive Statistics Class Time:

Names:

Student Learning Outcomes • The student will construct a histogram and a box plot.

• The student will calculate univariate statistics.

• The student will examine the graphs to interpret what the data implies.

Collect the Data Record the number of pairs of shoes you own.

1. Randomly survey 30 classmates about the number of pairs of shoes they own. Record their values.

_____ _____ _____ _____ _____

_____ _____ _____ _____ _____

_____ _____ _____ _____ _____

_____ _____ _____ _____ _____

_____ _____ _____ _____ _____

_____ _____ _____ _____ _____

Table 2.36 Survey Results

2. Construct a histogram. Make five to six intervals. Sketch the graph using a ruler and pencil and scale the axes.

Figure 2.30

3. Calculate the following values.

a. x̄ = _____

b. s = _____

4. Are the data discrete or continuous? How do you know?

Chapter 2 | Descriptive Statistics 121

 

 

5. In complete sentences, describe the shape of the histogram.

6. Are there any potential outliers? List the value(s) that could be outliers. Use a formula to check the end values to determine if they are potential outliers.

Analyze the Data 1. Determine the following values.

a. Min = _____

b. M = _____

c. Max = _____

d. Q1 = _____

e. Q3 = _____

f. IQR = _____

2. Construct a box plot of data

3. What does the shape of the box plot imply about the concentration of data? Use complete sentences.

4. Using the box plot, how can you determine if there are potential outliers?

5. How does the standard deviation help you to determine concentration of the data and whether or not there are potential outliers?

6. What does the IQR represent in this problem?

7. Show your work to find the value that is 1.5 standard deviations:

a. above the mean.

b. below the mean.

122 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

Box plot

First Quartile

Frequency

Frequency Polygon

Frequency Table

Histogram

Interquartile Range

Interval

Mean

Median

Midpoint

Mode

Outlier

Paired Data Set

Percentile

Quartiles

Relative Frequency

Skewed

Standard Deviation

Variance

KEY TERMS a graph that gives a quick picture of the middle 50% of the data

the value that is the median of the of the lower half of the ordered data set

the number of times a value of the data occurs

looks like a line graph but uses intervals to display ranges of large amounts of data

a data representation in which grouped data is displayed along with the corresponding frequencies

a graphical representation in x-y form of the distribution of data in a data set; x represents the data and y represents the frequency, or relative frequency. The graph consists of contiguous rectangles.

or IQR, is the range of the middle 50 percent of the data values; the IQR is found by subtracting the first quartile from the third quartile.

also called a class interval; an interval represents a range of data and is used when displaying large data sets

a number that measures the central tendency of the data; a common name for mean is ‘average.’ The term ‘mean’ is a shortened form of ‘arithmetic mean.’ By definition, the mean for a sample (denoted by x¯ ) is

x̄ = Sum of all values in the sampleNumber of values in the sample , and the mean for a population (denoted by μ) is

µ = Sum of all values in the populationNumber of values in the population .

a number that separates ordered data into halves; half the values are the same number or smaller than the median and half the values are the same number or larger than the median. The median may or may not be part of the data.

the mean of an interval in a frequency table

the value that appears most frequently in a set of data

an observation that does not fit the rest of the data

two data sets that have a one to one relationship so that:

• both data sets are the same size, and

• each data point in one data set is matched with exactly one point from the other set.

a number that divides ordered data into hundredths; percentiles may or may not be part of the data. The median of the data is the second quartile and the 50th percentile. The first and third quartiles are the 25th and the 75th percentiles, respectively.

the numbers that separate the data into quarters; quartiles may or may not be part of the data. The second quartile is the median of the data.

the ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all outcomes

used to describe data that is not symmetrical; when the right side of a graph looks “chopped off” compared the left side, we say it is “skewed to the left.” When the left side of the graph looks “chopped off” compared to the right side, we say the data is “skewed to the right.” Alternatively: when the lower values of the data are more spread out, we say the data are skewed to the left. When the greater values are more spread out, the data are skewed to the right.

a number that is equal to the square root of the variance and measures how far data values are from their mean; notation: s for sample standard deviation and σ for population standard deviation.

mean of the squared deviations from the mean, or the square of the standard deviation; for a set of data, a

Chapter 2 | Descriptive Statistics 123

 

 

deviation can be represented as x – x̄ where x is a value of the data and x̄ is the sample mean. The sample variance is equal to the sum of the squares of the deviations divided by the difference of the sample size and one.

CHAPTER REVIEW

2.1 Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs A stem-and-leaf plot is a way to plot data and look at the distribution. In a stem-and-leaf plot, all data values within a class are visible. The advantage in a stem-and-leaf plot is that all values are listed, unlike a histogram, which gives classes of data values. A line graph is often used to represent a set of data values in which a quantity varies with time. These graphs are useful for finding trends. That is, finding a general pattern in data sets including temperature, sales, employment, company profit or cost over a period of time. A bar graph is a chart that uses either horizontal or vertical bars to show comparisons among categories. One axis of the chart shows the specific categories being compared, and the other axis represents a discrete value. Some bar graphs present bars clustered in groups of more than one (grouped bar graphs), and others show the bars divided into subparts to show cumulative effect (stacked bar graphs). Bar graphs are especially useful when categorical data is being used.

2.2 Histograms, Frequency Polygons, and Time Series Graphs A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width drawn adjacent to each other. The horizontal scale represents classes of quantitative data values and the vertical scale represents frequencies. The heights of the bars correspond to frequency values. Histograms are typically used for large, continuous, quantitative data sets. A frequency polygon can also be used when graphing large data sets with data points that repeat. The data usually goes on y-axis with the frequency being graphed on the x-axis. Time series graphs can be helpful when looking at large amounts of data for one variable over a period of time.

2.3 Measures of the Location of the Data The values that divide a rank-ordered set of data into 100 equal parts are called percentiles. Percentiles are used to compare and interpret data. For example, an observation at the 50th percentile would be greater than 50 percent of the other obeservations in the set. Quartiles divide data into quarters. The first quartile (Q1) is the 25th percentile,the second quartile (Q2 or median) is 50th percentile, and the third quartile (Q3) is the the 75th percentile. The interquartile range, or IQR, is the range of the middle 50 percent of the data values. The IQR is found by subtracting Q1 from Q3, and can help determine outliers by using the following two expressions.

• Q3 + IQR(1.5)

• Q1 – IQR(1.5)

2.4 Box Plots Box plots are a type of graph that can help visually organize data. To graph a box plot the following data points must be calculated: the minimum value, the first quartile, the median, the third quartile, and the maximum value. Once the box plot is graphed, you can display and compare distributions of data.

2.5 Measures of the Center of the Data The mean and the median can be calculated to help you find the “center” of a data set. The mean is the best estimate for the actual data set, but the median is the best measurement when a data set contains several outliers or extreme values. The mode will tell you the most frequently occuring datum (or data) in your data set. The mean, median, and mode are extremely helpful when you need to analyze your data, but if your data set consists of ranges which lack specific values, the mean may seem impossible to calculate. However, the mean can be approximated if you add the lower boundary with the upper boundary and divide by two to find the midpoint of each interval. Multiply each midpoint by the number of values found in the corresponding range. Divide the sum of these values by the total number of data values in the set.

2.6 Skewness and the Mean, Median, and Mode Looking at the distribution of data can reveal a lot about the relationship between the mean, the median, and the mode. There are three types of distributions. A right (or positive) skewed distribution has a shape like Figure 2.17. A left (or negative) skewed distribution has a shape like Figure 2.18. A symmetrical distrubtion looks like Figure 2.16.

124 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

2.7 Measures of the Spread of the Data The standard deviation can help you calculate the spread of data. There are different equations to use if are calculating the standard deviation of a sample or of a population.

• The Standard Deviation allows us to compare individual data or classes to the data set mean numerically.

• s = ∑ (x − x̄ )

2

n − 1 or s = ∑ f (x − x̄ )

2

n − 1 is the formula for calculating the standard deviation of a sample.

To calculate the standard deviation of a population, we would use the population mean, μ, and the formula σ =

∑ (x − µ)2 N or σ =

∑ f (x − µ)2 N .

FORMULA REVIEW

2.3 Measures of the Location of the Data

i = ⎛⎝ k

100 ⎞ ⎠(n + 1)

where i = the ranking or position of a data value,

k = the kth percentile,

n = total number of data.

Expression for finding the percentile of a data value: ⎛ ⎝ x + 0.5y

n ⎞ ⎠ (100)

where x = the number of values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile,

y = the number of data values equal to the data value for which you want to find the percentile,

n = total number of data

2.5 Measures of the Center of the Data

µ = ∑ f m ∑ f

Where f = interval frequencies and m =

interval midpoints.

2.7 Measures of the Spread of the Data

sx = ∑ f m2

n − x̄ 2 where

sx = sample standard deviation

x̄ = sample mean

PRACTICE

2.1 Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs For each of the following data sets, create a stem plot and identify any outliers.

1. The miles per gallon rating for 30 cars are shown below (lowest to highest). 19, 19, 19, 20, 21, 21, 25, 25, 25, 26, 26, 28, 29, 31, 31, 32, 32, 33, 34, 35, 36, 37, 37, 38, 38, 38, 38, 41, 43, 43

2. The height in feet of 25 trees is shown below (lowest to highest). 25, 27, 33, 34, 34, 34, 35, 37, 37, 38, 39, 39, 39, 40, 41, 45, 46, 47, 49, 50, 50, 53, 53, 54, 54

3. The data are the prices of different laptops at an electronics store. Round each value to the nearest ten. 249, 249, 260, 265, 265, 280, 299, 299, 309, 319, 325, 326, 350, 350, 350, 365, 369, 389, 409, 459, 489, 559, 569, 570, 610

4. The data are daily high temperatures in a town for one month. 61, 61, 62, 64, 66, 67, 67, 67, 68, 69, 70, 70, 70, 71, 71, 72, 74, 74, 74, 75, 75, 75, 76, 76, 77, 78, 78, 79, 79, 95

For the next three exercises, use the data to construct a line graph.

Chapter 2 | Descriptive Statistics 125

 

 

5. In a survey, 40 people were asked how many times they visited a store before making a major purchase. The results are shown in Table 2.37.

Number of times in store Frequency

1 4

2 10

3 16

4 6

5 4

Table 2.37

6. In a survey, several people were asked how many years it has been since they purchased a mattress. The results are shown in Table 2.38.

Years since last purchase Frequency

0 2

1 8

2 13

3 22

4 16

5 9

Table 2.38

7. Several children were asked how many TV shows they watch each day. The results of the survey are shown in Table 2.39.

Number of TV Shows Frequency

0 12

1 18

2 36

3 7

4 2

Table 2.39

126 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

8. The students in Ms. Ramirez’s math class have birthdays in each of the four seasons. Table 2.40 shows the four seasons, the number of students who have birthdays in each season, and the percentage (%) of students in each group. Construct a bar graph showing the number of students.

Seasons Number of students Proportion of population

Spring 8 24%

Summer 9 26%

Autumn 11 32%

Winter 6 18%

Table 2.40

9. Using the data from Mrs. Ramirez’s math class supplied in Exercise 2.8, construct a bar graph showing the percentages.

10. David County has six high schools. Each school sent students to participate in a county-wide science competition. Table 2.41 shows the percentage breakdown of competitors from each school, and the percentage of the entire student population of the county that goes to each school. Construct a bar graph that shows the population percentage of competitors from each school.

High School Science competition population Overall student population

Alabaster 28.9% 8.6%

Concordia 7.6% 23.2%

Genoa 12.1% 15.0%

Mocksville 18.5% 14.3%

Tynneson 24.2% 10.1%

West End 8.7% 28.8%

Table 2.41

11. Use the data from the David County science competition supplied in Exercise 2.10. Construct a bar graph that shows the county-wide population percentage of students at each school.

2.2 Histograms, Frequency Polygons, and Time Series Graphs

12. Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one week. Fourteen people answered that they generally sell three cars; nineteen generally sell four cars; twelve generally sell five cars; nine generally sell six cars; eleven generally sell seven cars. Complete the table.

Data Value (# cars) Frequency Relative Frequency Cumulative Relative Frequency

Table 2.42

13. What does the frequency column in Table 2.42 sum to? Why?

14. What does the relative frequency column in Table 2.42 sum to? Why?

Chapter 2 | Descriptive Statistics 127

 

 

15. What is the difference between relative frequency and frequency for each data value in Table 2.42?

16. What is the difference between cumulative relative frequency and relative frequency for each data value?

17. To construct the histogram for the data in Table 2.42, determine appropriate minimum and maximum x and y values and the scaling. Sketch the histogram. Label the horizontal and vertical axes with words. Include numerical scaling.

Figure 2.31

128 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

18. Construct a frequency polygon for the following:

a. Pulse Rates for Women Frequency

60–69 12

70–79 14

80–89 11

90–99 1

100–109 1

110–119 0

120–129 1

Table 2.43

b. Actual Speed in a 30 MPH Zone Frequency

42–45 25

46–49 14

50–53 7

54–57 3

58–61 1

Table 2.44

c. Tar (mg) in Nonfiltered Cigarettes Frequency

10–13 1

14–17 0

18–21 15

22–25 7

26–29 2

Table 2.45

Chapter 2 | Descriptive Statistics 129

 

 

19. Construct a frequency polygon from the frequency distribution for the 50 highest ranked countries for depth of hunger.

Depth of Hunger Frequency

230–259 21

260–289 13

290–319 5

320–349 7

350–379 1

380–409 1

410–439 1

Table 2.46

20. Use the two frequency tables to compare the life expectancy of men and women from 20 randomly selected countries. Include an overlayed frequency polygon and discuss the shapes of the distributions, the center, the spread, and any outliers. What can we conclude about the life expectancy of women compared to men?

Life Expectancy at Birth – Women Frequency

49–55 3

56–62 3

63–69 1

70–76 3

77–83 8

84–90 2

Table 2.47

Life Expectancy at Birth – Men Frequency

49–55 3

56–62 3

63–69 1

70–76 1

77–83 7

84–90 5

Table 2.48

130 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

21. Construct a times series graph for (a) the number of male births, (b) the number of female births, and (c) the total number of births.

Sex/Year 1855 1856 1857 1858 1859 1860 1861

Female 45,545 49,582 50,257 50,324 51,915 51,220 52,403

Male 47,804 52,239 53,158 53,694 54,628 54,409 54,606

Total 93,349 101,821 103,415 104,018 106,543 105,629 107,009

Table 2.49

Sex/Year 1862 1863 1864 1865 1866 1867 1868 1869

Female 51,812 53,115 54,959 54,850 55,307 55,527 56,292 55,033

Male 55,257 56,226 57,374 58,220 58,360 58,517 59,222 58,321

Total 107,069 109,341 112,333 113,070 113,667 114,044 115,514 113,354

Table 2.50

Sex/Year 1870 1871 1872 1873 1874 1875

Female 56,431 56,099 57,472 58,233 60,109 60,146

Male 58,959 60,029 61,293 61,467 63,602 63,432

Total 115,390 116,128 118,765 119,700 123,711 123,578

Table 2.51

22. The following data sets list full time police per 100,000 citizens along with homicides per 100,000 citizens for the city of Detroit, Michigan during the period from 1961 to 1973.

Year 1961 1962 1963 1964 1965 1966 1967

Police 260.35 269.8 272.04 272.96 272.51 261.34 268.89

Homicides 8.6 8.9 8.52 8.89 13.07 14.57 21.36

Table 2.52

Year 1968 1969 1970 1971 1972 1973

Police 295.99 319.87 341.43 356.59 376.69 390.19

Homicides 28.03 31.49 37.39 46.26 47.24 52.33

Table 2.53

a. Construct a double time series graph using a common x-axis for both sets of data. b. Which variable increased the fastest? Explain. c. Did Detroit’s increase in police officers have an impact on the murder rate? Explain.

Chapter 2 | Descriptive Statistics 131

 

 

2.3 Measures of the Location of the Data

23. Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.

18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

a. Find the 40th percentile. b. Find the 78th percentile.

24. Listed are 32 ages for Academy Award winning best actors in order from smallest to largest.

18; 18; 21; 22; 25; 26; 27; 29; 30; 31; 31; 33; 36; 37; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

a. Find the percentile of 37. b. Find the percentile of 72.

25. Jesse was ranked 37th in his graduating class of 180 students. At what percentile is Jesse’s ranking?

26. a. For runners in a race, a low time means a faster run. The winners in a race have the shortest running times. Is it

more desirable to have a finish time with a high or a low percentile when running a race? b. The 20th percentile of run times in a particular race is 5.2 minutes. Write a sentence interpreting the 20th percentile

in the context of the situation. c. A bicyclist in the 90th percentile of a bicycle race completed the race in 1 hour and 12 minutes. Is he among

the fastest or slowest cyclists in the race? Write a sentence interpreting the 90th percentile in the context of the situation.

27. a. For runners in a race, a higher speed means a faster run. Is it more desirable to have a speed with a high or a low

percentile when running a race? b. The 40th percentile of speeds in a particular race is 7.5 miles per hour. Write a sentence interpreting the 40th

percentile in the context of the situation.

28. On an exam, would it be more desirable to earn a grade with a high or low percentile? Explain.

29. Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of 32 minutes is the 85th percentile of wait times. Is that good or bad? Write a sentence interpreting the 85th percentile in the context of this situation.

30. In a survey collecting data about the salaries earned by recent college graduates, Li found that her salary was in the 78th percentile. Should Li be pleased or upset by this result? Explain.

31. In a study collecting data about the repair costs of damage to automobiles in a certain type of crash tests, a certain model of car had $1,700 in damage and was in the 90th percentile. Should the manufacturer and the consumer be pleased or upset by this result? Explain and write a sentence that interprets the 90th percentile in the context of this problem.

32. The University of California has two criteria used to set admission standards for freshman to be admitted to a college in the UC system:

a. Students’ GPAs and scores on standardized tests (SATs and ACTs) are entered into a formula that calculates an “admissions index” score. The admissions index score is used to set eligibility standards intended to meet the goal of admitting the top 12% of high school students in the state. In this context, what percentile does the top 12% represent?

b. Students whose GPAs are at or above the 96th percentile of all students at their high school are eligible (called eligible in the local context), even if they are not in the top 12% of all students in the state. What percentage of students from each high school are “eligible in the local context”?

33. Suppose that you are buying a house. You and your realtor have determined that the most expensive house you can afford is the 34th percentile. The 34th percentile of housing prices is $240,000 in the town you want to move to. In this town, can you afford 34% of the houses or 66% of the houses? Use the following information to answer the next six exercises. Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one week. Fourteen people answered that they generally sell three cars; nineteen generally sell four cars; twelve generally sell five cars; nine generally sell six cars; eleven generally sell seven cars.

34. First quartile = _______

35. Second quartile = median = 50th percentile = _______

36. Third quartile = _______

132 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

37. Interquartile range (IQR) = _____ – _____ = _____

38. 10th percentile = _______

39. 70th percentile = _______

2.4 Box Plots Use the following information to answer the next two exercises. Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one week. Fourteen people answered that they generally sell three cars; nineteen generally sell four cars; twelve generally sell five cars; nine generally sell six cars; eleven generally sell seven cars.

40. Construct a box plot below. Use a ruler to measure and scale accurately.

41. Looking at your box plot, does it appear that the data are concentrated together, spread out evenly, or concentrated in some areas, but not in others? How can you tell?

2.5 Measures of the Center of the Data

42. Find the mean for the following frequency tables.

a. Grade Frequency

49.5–59.5 2

59.5–69.5 3

69.5–79.5 8

79.5–89.5 12

89.5–99.5 5

Table 2.54

b. Daily Low Temperature Frequency

49.5–59.5 53

59.5–69.5 32

69.5–79.5 15

79.5–89.5 1

89.5–99.5 0

Table 2.55

c. Points per Game Frequency

49.5–59.5 14

59.5–69.5 32

69.5–79.5 15

79.5–89.5 23

89.5–99.5 2

Table 2.56

Use the following information to answer the next three exercises: The following data show the lengths of boats moored in a marina. The data are ordered from smallest to largest: 16; 17; 19; 20; 20; 21; 23; 24; 25; 25; 25; 26; 26; 27; 27; 27; 28; 29; 30; 32; 33; 33; 34; 35; 37; 39; 40

Chapter 2 | Descriptive Statistics 133

 

 

43. Calculate the mean.

44. Identify the median.

45. Identify the mode.

Use the following information to answer the next three exercises: Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one week. Fourteen people answered that they generally sell three cars; nineteen generally sell four cars; twelve generally sell five cars; nine generally sell six cars; eleven generally sell seven cars. Calculate the following:

46. sample mean = x¯ = _______

47. median = _______

48. mode = _______

2.6 Skewness and the Mean, Median, and Mode Use the following information to answer the next three exercises: State whether the data are symmetrical, skewed to the left, or skewed to the right.

49. 1; 1; 1; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 4; 4; 4; 5; 5

50. 16; 17; 19; 22; 22; 22; 22; 22; 23

51. 87; 87; 87; 87; 87; 88; 89; 89; 90; 91

52. When the data are skewed left, what is the typical relationship between the mean and median?

53. When the data are symmetrical, what is the typical relationship between the mean and median?

54. What word describes a distribution that has two modes?

55. Describe the shape of this distribution.

Figure 2.32

134 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

56. Describe the relationship between the mode and the median of this distribution.

Figure 2.33

57. Describe the relationship between the mean and the median of this distribution.

Figure 2.34

58. Describe the shape of this distribution.

Figure 2.35

Chapter 2 | Descriptive Statistics 135

 

 

59. Describe the relationship between the mode and the median of this distribution.

Figure 2.36

60. Are the mean and the median the exact same in this distribution? Why or why not?

Figure 2.37

61. Describe the shape of this distribution.

Figure 2.38

136 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

62. Describe the relationship between the mode and the median of this distribution.

Figure 2.39

63. Describe the relationship between the mean and the median of this distribution.

Figure 2.40

64. The mean and median for the data are the same.

3; 4; 5; 5; 6; 6; 6; 6; 7; 7; 7; 7; 7; 7; 7

Is the data perfectly symmetrical? Why or why not?

65. Which is the greatest, the mean, the mode, or the median of the data set?

11; 11; 12; 12; 12; 12; 13; 15; 17; 22; 22; 22

66. Which is the least, the mean, the mode, and the median of the data set?

56; 56; 56; 58; 59; 60; 62; 64; 64; 65; 67

67. Of the three measures, which tends to reflect skewing the most, the mean, the mode, or the median? Why?

68. In a perfectly symmetrical distribution, when would the mode be different from the mean and median?

2.7 Measures of the Spread of the Data Use the following information to answer the next two exercises: The following data are the distances between 20 retail stores and a large distribution center. The distances are in miles. 29; 37; 38; 40; 58; 67; 68; 69; 76; 86; 87; 95; 96; 96; 99; 106; 112; 127; 145; 150

69. Use a graphing calculator or computer to find the standard deviation and round to the nearest tenth.

Chapter 2 | Descriptive Statistics 137

 

 

70. Find the value that is one standard deviation below the mean.

71. Two baseball players, Fredo and Karl, on different teams wanted to find out who had the higher batting average when compared to his team. Which baseball player had the higher batting average when compared to his team?

Baseball Player Batting Average Team Batting Average Team Standard Deviation

Fredo 0.158 0.166 0.012

Karl 0.177 0.189 0.015

Table 2.57

72. Use Table 2.57 to find the value that is three standard deviations: a. above the mean b. below the mean

Find the standard deviation for the following frequency tables using the formula. Check the calculations with the TI 83/84.

138 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

73. Find the standard deviation for the following frequency tables using the formula. Check the calculations with the TI 83/ 84.

a. Grade Frequency

49.5–59.5 2

59.5–69.5 3

69.5–79.5 8

79.5–89.5 12

89.5–99.5 5

Table 2.58

b. Daily Low Temperature Frequency

49.5–59.5 53

59.5–69.5 32

69.5–79.5 15

79.5–89.5 1

89.5–99.5 0

Table 2.59

c. Points per Game Frequency

49.5–59.5 14

59.5–69.5 32

69.5–79.5 15

79.5–89.5 23

89.5–99.5 2

Table 2.60

HOMEWORK

2.1 Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs

74. Student grades on a chemistry exam were: 77, 78, 76, 81, 86, 51, 79, 82, 84, 99 a. Construct a stem-and-leaf plot of the data. b. Are there any potential outliers? If so, which scores are they? Why do you consider them outliers?

Chapter 2 | Descriptive Statistics 139

 

 

75. Table 2.61 contains the 2010 obesity rates in U.S. states and Washington, DC.

State Percent (%) State Percent (%) State Percent (%)

Alabama 32.2 Kentucky 31.3 North Dakota 27.2

Alaska 24.5 Louisiana 31.0 Ohio 29.2

Arizona 24.3 Maine 26.8 Oklahoma 30.4

Arkansas 30.1 Maryland 27.1 Oregon 26.8

California 24.0 Massachusetts 23.0 Pennsylvania 28.6

Colorado 21.0 Michigan 30.9 Rhode Island 25.5

Connecticut 22.5 Minnesota 24.8 South Carolina 31.5

Delaware 28.0 Mississippi 34.0 South Dakota 27.3

Washington, DC 22.2 Missouri 30.5 Tennessee 30.8

Florida 26.6 Montana 23.0 Texas 31.0

Georgia 29.6 Nebraska 26.9 Utah 22.5

Hawaii 22.7 Nevada 22.4 Vermont 23.2

Idaho 26.5 New Hampshire 25.0 Virginia 26.0

Illinois 28.2 New Jersey 23.8 Washington 25.5

Indiana 29.6 New Mexico 25.1 West Virginia 32.5

Iowa 28.4 New York 23.9 Wisconsin 26.3

Kansas 29.4 North Carolina 27.8 Wyoming 25.1

Table 2.61

a. Use a random number generator to randomly pick eight states. Construct a bar graph of the obesity rates of those eight states.

b. Construct a bar graph for all the states beginning with the letter “A.” c. Construct a bar graph for all the states beginning with the letter “M.”

140 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

2.2 Histograms, Frequency Polygons, and Time Series Graphs

Chapter 2 | Descriptive Statistics 141

 

 

76. Suppose that three book publishers were interested in the number of fiction paperbacks adult consumers purchase per month. Each publisher conducted a survey. In the survey, adult consumers were asked the number of fiction paperbacks they had purchased the previous month. The results are as follows:

# of books Freq. Rel. Freq.

0 10

1 12

2 16

3 12

4 8

5 6

6 2

8 2

Table 2.62 Publisher A

# of books Freq. Rel. Freq.

0 18

1 24

2 24

3 22

4 15

5 10

7 5

9 1

Table 2.63 Publisher B

# of books Freq. Rel. Freq.

0–1 20

2–3 35

4–5 12

6–7 2

8–9 1

Table 2.64 Publisher C

142 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

a. Find the relative frequencies for each survey. Write them in the charts. b. Using either a graphing calculator, computer, or by hand, use the frequency column to construct a histogram for

each publisher’s survey. For Publishers A and B, make bar widths of one. For Publisher C, make bar widths of two.

c. In complete sentences, give two reasons why the graphs for Publishers A and B are not identical. d. Would you have expected the graph for Publisher C to look like the other two graphs? Why or why not? e. Make new histograms for Publisher A and Publisher B. This time, make bar widths of two. f. Now, compare the graph for Publisher C to the new graphs for Publishers A and B. Are the graphs more similar

or more different? Explain your answer.

Chapter 2 | Descriptive Statistics 143

 

 

77. Often, cruise ships conduct all on-board transactions, with the exception of gambling, on a cashless basis. At the end of the cruise, guests pay one bill that covers all onboard transactions. Suppose that 60 single travelers and 70 couples were surveyed as to their on-board bills for a seven-day cruise from Los Angeles to the Mexican Riviera. Following is a summary of the bills for each group.

Amount($) Frequency Rel. Frequency

51–100 5

101–150 10

151–200 15

201–250 15

251–300 10

301–350 5

Table 2.65 Singles

Amount($) Frequency Rel. Frequency

100–150 5

201–250 5

251–300 5

301–350 5

351–400 10

401–450 10

451–500 10

501–550 10

551–600 5

601–650 5

Table 2.66 Couples

a. Fill in the relative frequency for each group. b. Construct a histogram for the singles group. Scale the x-axis by $50 widths. Use relative frequency on the y-axis. c. Construct a histogram for the couples group. Scale the x-axis by $50 widths. Use relative frequency on the y-axis. d. Compare the two graphs:

i. List two similarities between the graphs. ii. List two differences between the graphs.

iii. Overall, are the graphs more similar or different? e. Construct a new graph for the couples by hand. Since each couple is paying for two individuals, instead of scaling

the x-axis by $50, scale it by $100. Use relative frequency on the y-axis. f. Compare the graph for the singles with the new graph for the couples:

i. List two similarities between the graphs. ii. Overall, are the graphs more similar or different?

g. How did scaling the couples graph differently change the way you compared it to the singles graph? h. Based on the graphs, do you think that individuals spend the same amount, more or less, as singles as they do

person by person as a couple? Explain why in one or two complete sentences.

144 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

78. Twenty-five randomly selected students were asked the number of movies they watched the previous week. The results are as follows.

# of movies Frequency Relative Frequency Cumulative Relative Frequency

0 5

1 9

2 6

3 4

4 1

Table 2.67

a. Construct a histogram of the data. b. Complete the columns of the chart.

Use the following information to answer the next two exercises: Suppose one hundred eleven people who shopped in a special t-shirt store were asked the number of t-shirts they own costing more than $19 each.

79. The percentage of people who own at most three t-shirts costing more than $19 each is approximately: a. 21 b. 59 c. 41 d. Cannot be determined

80. If the data were collected by asking the first 111 people who entered the store, then the type of sampling is: a. cluster b. simple random c. stratified d. convenience

Chapter 2 | Descriptive Statistics 145

 

 

81. Following are the 2010 obesity rates by U.S. states and Washington, DC.

State Percent (%) State Percent (%) State Percent (%)

Alabama 32.2 Kentucky 31.3 North Dakota 27.2

Alaska 24.5 Louisiana 31.0 Ohio 29.2

Arizona 24.3 Maine 26.8 Oklahoma 30.4

Arkansas 30.1 Maryland 27.1 Oregon 26.8

California 24.0 Massachusetts 23.0 Pennsylvania 28.6

Colorado 21.0 Michigan 30.9 Rhode Island 25.5

Connecticut 22.5 Minnesota 24.8 South Carolina 31.5

Delaware 28.0 Mississippi 34.0 South Dakota 27.3

Washington, DC 22.2 Missouri 30.5 Tennessee 30.8

Florida 26.6 Montana 23.0 Texas 31.0

Georgia 29.6 Nebraska 26.9 Utah 22.5

Hawaii 22.7 Nevada 22.4 Vermont 23.2

Idaho 26.5 New Hampshire 25.0 Virginia 26.0

Illinois 28.2 New Jersey 23.8 Washington 25.5

Indiana 29.6 New Mexico 25.1 West Virginia 32.5

Iowa 28.4 New York 23.9 Wisconsin 26.3

Kansas 29.4 North Carolina 27.8 Wyoming 25.1

Table 2.68

Construct a bar graph of obesity rates of your state and the four states closest to your state. Hint: Label the x-axis with the states.

2.3 Measures of the Location of the Data

82. The median age for U.S. blacks currently is 30.9 years; for U.S. whites it is 42.3 years. a. Based upon this information, give two reasons why the black median age could be lower than the white median

age. b. Does the lower median age for blacks necessarily mean that blacks die younger than whites? Why or why not? c. How might it be possible for blacks and whites to die at approximately the same age, but for the median age for

whites to be higher?

146 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

83. Six hundred adult Americans were asked by telephone poll, “What do you think constitutes a middle-class income?” The results are in Table 2.69. Also, include left endpoint, but not the right endpoint.

Salary ($) Relative Frequency

< 20,000 0.02

20,000–25,000 0.09

25,000–30,000 0.19

30,000–40,000 0.26

40,000–50,000 0.18

50,000–75,000 0.17

75,000–99,999 0.02

100,000+ 0.01

Table 2.69

a. What percentage of the survey answered “not sure”? b. What percentage think that middle-class is from $25,000 to $50,000? c. Construct a histogram of the data.

i. Should all bars have the same width, based on the data? Why or why not? ii. How should the <20,000 and the 100,000+ intervals be handled? Why?

d. Find the 40th and 80th percentiles e. Construct a bar graph of the data

84. Given the following box plot:

Figure 2.41 a. which quarter has the smallest spread of data? What is that spread? b. which quarter has the largest spread of data? What is that spread? c. find the interquartile range (IQR). d. are there more data in the interval 5–10 or in the interval 10–13? How do you know this? e. which interval has the fewest data in it? How do you know this?

i. 0–2 ii. 2–4

iii. 10–12 iv. 12–13 v. need more information

Chapter 2 | Descriptive Statistics 147

 

 

85. The following box plot shows the U.S. population for 1990, the latest available year.

Figure 2.42 a. Are there fewer or more children (age 17 and under) than senior citizens (age 65 and over)? How do you know? b. 12.6% are age 65 and over. Approximately what percentage of the population are working age adults (above age

17 to age 65)?

2.4 Box Plots

86. In a survey of 20-year-olds in China, Germany, and the United States, people were asked the number of foreign countries they had visited in their lifetime. The following box plots display the results.

Figure 2.43 a. In complete sentences, describe what the shape of each box plot implies about the distribution of the data

collected. b. Have more Americans or more Germans surveyed been to over eight foreign countries? c. Compare the three box plots. What do they imply about the foreign travel of 20-year-old residents of the three

countries when compared to each other?

87. Given the following box plot, answer the questions.

Figure 2.44 a. Think of an example (in words) where the data might fit into the above box plot. In 2–5 sentences, write down the

example. b. What does it mean to have the first and second quartiles so close together, while the second to third quartiles are

far apart?

148 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

88. Given the following box plots, answer the questions.

Figure 2.45 a. In complete sentences, explain why each statement is false.

i. Data 1 has more data values above two than Data 2 has above two. ii. The data sets cannot have the same mode.

iii. For Data 1, there are more data values below four than there are above four. b. For which group, Data 1 or Data 2, is the value of “7” more likely to be an outlier? Explain why in complete

sentences.

Chapter 2 | Descriptive Statistics 149

 

 

89. A survey was conducted of 130 purchasers of new BMW 3 series cars, 130 purchasers of new BMW 5 series cars, and 130 purchasers of new BMW 7 series cars. In it, people were asked the age they were when they purchased their car. The following box plots display the results.

Figure 2.46 a. In complete sentences, describe what the shape of each box plot implies about the distribution of the data collected

for that car series. b. Which group is most likely to have an outlier? Explain how you determined that. c. Compare the three box plots. What do they imply about the age of purchasing a BMW from the series when

compared to each other? d. Look at the BMW 5 series. Which quarter has the smallest spread of data? What is the spread? e. Look at the BMW 5 series. Which quarter has the largest spread of data? What is the spread? f. Look at the BMW 5 series. Estimate the interquartile range (IQR). g. Look at the BMW 5 series. Are there more data in the interval 31 to 38 or in the interval 45 to 55? How do you

know this? h. Look at the BMW 5 series. Which interval has the fewest data in it? How do you know this?

i. 31–35 ii. 38–41

iii. 41–64

90. Twenty-five randomly selected students were asked the number of movies they watched the previous week. The results are as follows:

# of movies Frequency

0 5

1 9

2 6

3 4

4 1

Table 2.70

Construct a box plot of the data.

150 Chapter 2 | Descriptive Statistics

This OpenStax book is available for free at http://cnx.org/content/col11562/1.18

 

 

2.5 Measures of the Center of the Data

91. The most obese countries in the world have obesity rates that range from 11.4% to 74.6%. This data is summarized in the following table.

Percent of Population Obese Number of Countries

11.4–20.45 29

20.45–29.45 13

29.45–38.45 4

38.45–47.45 0

47.45–56.45 2

56.45–65.45 1

65.45–74.45 0

74.45–83.45 1

Table 2.71

a. What is the best estimate of the average obesity percentage for these countries? b. The United States has an average obesity rate of 33.9%. Is this rate above average or below? c. How does the United States compare to other countries?

92. Table 2.72 gives the percent of children under five considered to be underweight. What is the best estimate for the mean percentage of underweight children?

Percent of Underweight Children Number of Countries

16–21.45 23

21.45–26.9 4

26.9–32.35 9

32.35–37.8 7

37.8–43.25 6

43.25–48.7 1

Table 2.72

2.6 Skewness and the Mean, Median, and Mode

93. The median age of the U.S. population in 1980 was 30.0 years. In 1991, the median age was 33.1 years. a. What does it mean for the median age to rise? b. Give two reasons why the median age could rise. c. For the median age to rise, is the actual number of children less in 1991 than it was in 1980? Why or why not?

2.7 Measures of the Spread of the Data

Use the following information to answer the next nine exercises: The population parameters below describe the full-time equivalent number of students (FTES) each year at Lake Tahoe Community College from 1976–1977 through 2004–2005.

• μ = 1000 FTES

• median = 1,014 FTES

• σ = 474 FTES

Chapter 2 | Descriptive Statistics 151

 

 

• first quartile = 528.5 FTES

• third quartile = 1,447.5 FTES

• n = 29 years

 

What Students Are Saying About Us

.......... Customer ID: 12*** | Rating: ⭐⭐⭐⭐⭐
"Honestly, I was afraid to send my paper to you, but splendidwritings.com proved they are a trustworthy service. My essay was done in less than a day, and I received a brilliant piece. I didn’t even believe it was my essay at first 🙂 Great job, thank you!"

.......... Customer ID: 14***| Rating: ⭐⭐⭐⭐⭐
"The company has some nice prices and good content. I ordered a term paper here and got a very good one. I'll keep ordering from this website."

"Order a Custom Paper on Similar Assignment! No Plagiarism! Enjoy 20% Discount"